Machine Learning KNN
Instance-based

K-Nearest Neighbours (KNN)

KNN is a simple, non-parametric algorithm that classifies a point based on the majority label among its K closest neighbours.

Intuition

  • Store the entire training set in memory.
  • To classify a new point, compute distances to all training points.
  • Pick the top K closest points and vote their labels.

KNN with scikit-learn

KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

knn = KNeighborsClassifier(
    n_neighbors=5,
    metric="minkowski",
    p=2  # Euclidean distance
)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

Distance Metrics & Variants

KNN relies entirely on the notion of distance. Common choices include:

  • Euclidean distance (\(p = 2\)) – default for continuous features.
  • Manhattan distance (\(p = 1\)) – more robust to outliers.
  • Cosine similarity – for high‑dimensional sparse data (e.g., text).

scikit‑learn exposes this via the metric parameter, and you can also weight neighbours by distance:

knn = KNeighborsClassifier(
    n_neighbors=10,
    weights="distance",   # closer neighbours have more influence
    metric="manhattan"
)

Choosing K & Practical Tips

  • Small K (e.g. 1–3) can overfit noise.
  • Large K oversmooths decision boundaries and may underfit.
  • Always scale numeric features (e.g. with StandardScaler) so that one feature’s units do not dominate distance.
  • KNN can be expensive for very large datasets because prediction is \(O(N)\) per query.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

knn_clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=7))
])