Machine Learning
KNN
Instance-based
K-Nearest Neighbours (KNN)
KNN is a simple, non-parametric algorithm that classifies a point based on the majority label among its K closest neighbours.
Intuition
- Store the entire training set in memory.
- To classify a new point, compute distances to all training points.
- Pick the top K closest points and vote their labels.
KNN with scikit-learn
KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
knn = KNeighborsClassifier(
n_neighbors=5,
metric="minkowski",
p=2 # Euclidean distance
)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
Distance Metrics & Variants
KNN relies entirely on the notion of distance. Common choices include:
- Euclidean distance (\(p = 2\)) – default for continuous features.
- Manhattan distance (\(p = 1\)) – more robust to outliers.
- Cosine similarity – for high‑dimensional sparse data (e.g., text).
scikit‑learn exposes this via the metric parameter, and you can also weight neighbours by distance:
knn = KNeighborsClassifier(
n_neighbors=10,
weights="distance", # closer neighbours have more influence
metric="manhattan"
)
Choosing K & Practical Tips
- Small K (e.g. 1–3) can overfit noise.
- Large K oversmooths decision boundaries and may underfit.
- Always scale numeric features (e.g. with StandardScaler) so that one feature’s units do not dominate distance.
- KNN can be expensive for very large datasets because prediction is \(O(N)\) per query.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
knn_clf = Pipeline(steps=[
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=7))
])