K-Means Clustering

K-Means partitions data into K clusters by minimizing the distance between points and their assigned cluster centroids.

Objective Function

K-Means minimizes the sum of squared distances from each point to its assigned cluster centroid:

\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2 \]

Algorithm Steps

Initialize K centroids (randomly or with k‑means++).
Assign each point to the nearest centroid.
Recompute centroids as the mean of assigned points.
Repeat steps 2–3 until convergence (assignments stop changing).

K-Means with scikit-learn

Clustering with KMeans

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Find optimal K using elbow method
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, marker="o")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

Advanced Topics & Limitations

Initialization: use init="k-means++" (default in scikit‑learn) to reduce the chance of poor local minima.
Scaling: always scale features before K‑Means so that each dimension contributes fairly to the distance.
Cluster shape: K‑Means works best for roughly spherical, equally sized clusters.
Outliers: strongly affected by outliers; consider removing or using robust alternatives like K‑Medoids.

kmeans = KMeans(
    n_clusters=3,
    init="k-means++",
    n_init=10,
    max_iter=300,
    random_state=42
)

Previous: KNN Next: Naive Bayes