Machine Learning

Clustering Algorithms

K-means clustering and anomaly detection for unsupervised learning.

K-Means Clustering

Objective Function

K-Means minimizes the sum of squared distances from each point to its assigned cluster centroid:

\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]

Algorithm Steps

Initialize K centroids (randomly or with kâ€‘means++).
Assign each point to the nearest centroid.
Recompute centroids as the mean of assigned points.
Repeat steps 2â€“3 until convergence (assignments stop changing).

K-Means with scikit-learn

Clustering with KMeans

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Find optimal K using elbow method
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, marker="o")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

Advanced Topics & Limitations

Initialization: use init="k-means++" (default in scikitâ€‘learn) to reduce the chance of poor local minima.
Scaling: always scale features before Kâ€‘Means so that each dimension contributes fairly to the distance.
Cluster shape: Kâ€‘Means works best for roughly spherical, equally sized clusters.
Outliers: strongly affected by outliers; consider removing or using robust alternatives like Kâ€‘Medoids.

kmeans = KMeans(
    n_clusters=3,
    init="k-means++",
    n_init=10,
    max_iter=300,
    random_state=42
)

Anomaly Detection

Real-World Use Cases

Credit card fraud detection.
Network intrusion detection.
Industrial equipment fault monitoring.
Medical anomaly detection (rare diseases, unusual lab results).

Isolation Forest

Isolation Forest isolates anomalies by randomly partitioning the feature space; anomalies are easier to isolate and thus have shorter average path lengths in the trees.

IsolationForest with scikit-learn

from sklearn.ensemble import IsolationForest

iso = IsolationForest(
    n_estimators=200,
    contamination=0.02,
    random_state=42
)
iso.fit(X_train)

scores = iso.decision_function(X_test)
labels = iso.predict(X_test)  # -1 = anomaly, 1 = normal

One-Class SVM

Oneâ€‘Class SVM learns a decision boundary around the "normal" class and flags points that lie outside this region as anomalies.

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
ocsvm.fit(X_train_normal)

pred = ocsvm.predict(X_test)  # -1 anomaly, 1 normal

Evaluating Anomaly Detectors

Evaluation is tricky because anomalies are rare and labels may be incomplete.

Use precisionâ€‘recall curves instead of accuracy for highly imbalanced data.
Work closely with domain experts to validate flagged anomalies.
Consider costâ€‘sensitive metrics (false negatives are often more expensive than false positives).

Previous Next