K-Nearest Neighbours (KNN)

KNN is a simple, non-parametric algorithm that classifies a point based on the majority label among its K closest neighbours.

Intuition

Store the entire training set in memory.
To classify a new point, compute distances to all training points.
Pick the top K closest points and vote their labels.

KNN with scikit-learn

KNN Classifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

knn = KNeighborsClassifier(
    n_neighbors=5,
    metric="minkowski",
    p=2  # Euclidean distance
)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

Distance Metrics & Variants

KNN relies entirely on the notion of distance. Common choices include:

Euclidean distance (\(p = 2\)) – default for continuous features.
Manhattan distance (\(p = 1\)) – more robust to outliers.
Cosine similarity – for high‑dimensional sparse data (e.g., text).

scikit‑learn exposes this via the metric parameter, and you can also weight neighbours by distance:

knn = KNeighborsClassifier(
    n_neighbors=10,
    weights="distance",   # closer neighbours have more influence
    metric="manhattan"
)

Choosing K & Practical Tips

Small K (e.g. 1–3) can overfit noise.
Large K oversmooths decision boundaries and may underfit.
Always scale numeric features (e.g. with StandardScaler) so that one feature’s units do not dominate distance.
KNN can be expensive for very large datasets because prediction is \(O(N)\) per query.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

knn_clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=7))
])

Previous: SVM Next: K-Means Clustering