Machine Learning

Clustering Algorithms

K-means clustering and anomaly detection for unsupervised learning.

K-Means Clustering

Objective Function

K-Means minimizes the sum of squared distances from each point to its assigned cluster centroid:

\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]

Algorithm Steps

  1. Initialize K centroids (randomly or with k‑means++).
  2. Assign each point to the nearest centroid.
  3. Recompute centroids as the mean of assigned points.
  4. Repeat steps 2–3 until convergence (assignments stop changing).

K-Means with scikit-learn

Clustering with KMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Find optimal K using elbow method
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, marker="o")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

Advanced Topics & Limitations

  • Initialization: use init="k-means++" (default in scikit‑learn) to reduce the chance of poor local minima.
  • Scaling: always scale features before K‑Means so that each dimension contributes fairly to the distance.
  • Cluster shape: K‑Means works best for roughly spherical, equally sized clusters.
  • Outliers: strongly affected by outliers; consider removing or using robust alternatives like K‑Medoids.
kmeans = KMeans(
    n_clusters=3,
    init="k-means++",
    n_init=10,
    max_iter=300,
    random_state=42
)

Anomaly Detection

Real-World Use Cases

  • Credit card fraud detection.
  • Network intrusion detection.
  • Industrial equipment fault monitoring.
  • Medical anomaly detection (rare diseases, unusual lab results).

Isolation Forest

Isolation Forest isolates anomalies by randomly partitioning the feature space; anomalies are easier to isolate and thus have shorter average path lengths in the trees.

IsolationForest with scikit-learn
from sklearn.ensemble import IsolationForest

iso = IsolationForest(
    n_estimators=200,
    contamination=0.02,
    random_state=42
)
iso.fit(X_train)

scores = iso.decision_function(X_test)
labels = iso.predict(X_test)  # -1 = anomaly, 1 = normal

One-Class SVM

One‑Class SVM learns a decision boundary around the "normal" class and flags points that lie outside this region as anomalies.

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
ocsvm.fit(X_train_normal)

pred = ocsvm.predict(X_test)  # -1 anomaly, 1 normal

Evaluating Anomaly Detectors

Evaluation is tricky because anomalies are rare and labels may be incomplete.

  • Use precision‑recall curves instead of accuracy for highly imbalanced data.
  • Work closely with domain experts to validate flagged anomalies.
  • Consider cost‑sensitive metrics (false negatives are often more expensive than false positives).