Machine Learning
K-Means
Clustering
K-Means Clustering
K-Means partitions data into K clusters by minimizing the distance between points and their assigned cluster centroids.
Objective Function
K-Means minimizes the sum of squared distances from each point to its assigned cluster centroid:
\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]
Algorithm Steps
- Initialize K centroids (randomly or with k‑means++).
- Assign each point to the nearest centroid.
- Recompute centroids as the mean of assigned points.
- Repeat steps 2–3 until convergence (assignments stop changing).
K-Means with scikit-learn
Clustering with KMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Find optimal K using elbow method
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, marker="o")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()
# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
Advanced Topics & Limitations
- Initialization: use
init="k-means++"(default in scikit‑learn) to reduce the chance of poor local minima. - Scaling: always scale features before K‑Means so that each dimension contributes fairly to the distance.
- Cluster shape: K‑Means works best for roughly spherical, equally sized clusters.
- Outliers: strongly affected by outliers; consider removing or using robust alternatives like K‑Medoids.
kmeans = KMeans(
n_clusters=3,
init="k-means++",
n_init=10,
max_iter=300,
random_state=42
)