Machine Learning
Clustering Algorithms
K-means clustering and anomaly detection for unsupervised learning.
K-Means Clustering
Objective Function
K-Means minimizes the sum of squared distances from each point to its assigned cluster centroid:
\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]
Algorithm Steps
- Initialize K centroids (randomly or with k‑means++).
- Assign each point to the nearest centroid.
- Recompute centroids as the mean of assigned points.
- Repeat steps 2–3 until convergence (assignments stop changing).
K-Means with scikit-learn
Clustering with KMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Find optimal K using elbow method
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, marker="o")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()
# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
Advanced Topics & Limitations
- Initialization: use
init="k-means++"(default in scikit‑learn) to reduce the chance of poor local minima. - Scaling: always scale features before K‑Means so that each dimension contributes fairly to the distance.
- Cluster shape: K‑Means works best for roughly spherical, equally sized clusters.
- Outliers: strongly affected by outliers; consider removing or using robust alternatives like K‑Medoids.
kmeans = KMeans(
n_clusters=3,
init="k-means++",
n_init=10,
max_iter=300,
random_state=42
)
Anomaly Detection
Real-World Use Cases
- Credit card fraud detection.
- Network intrusion detection.
- Industrial equipment fault monitoring.
- Medical anomaly detection (rare diseases, unusual lab results).
Isolation Forest
Isolation Forest isolates anomalies by randomly partitioning the feature space; anomalies are easier to isolate and thus have shorter average path lengths in the trees.
IsolationForest with scikit-learn
from sklearn.ensemble import IsolationForest
iso = IsolationForest(
n_estimators=200,
contamination=0.02,
random_state=42
)
iso.fit(X_train)
scores = iso.decision_function(X_test)
labels = iso.predict(X_test) # -1 = anomaly, 1 = normal
One-Class SVM
One‑Class SVM learns a decision boundary around the "normal" class and flags points that lie outside this region as anomalies.
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
ocsvm.fit(X_train_normal)
pred = ocsvm.predict(X_test) # -1 anomaly, 1 normal
Evaluating Anomaly Detectors
Evaluation is tricky because anomalies are rare and labels may be incomplete.
- Use precision‑recall curves instead of accuracy for highly imbalanced data.
- Work closely with domain experts to validate flagged anomalies.
- Consider cost‑sensitive metrics (false negatives are often more expensive than false positives).