Dimensionality Reduction

Dimensionality reduction techniques help you simplify high‑dimensional data, speed up models and reduce overfitting while preserving as much structure as possible.

Why Reduce Dimensions?

High‑dimensional data can lead to the curse of dimensionality for distance‑based methods like KNN and clustering.
Redundant or noisy features can harm model performance.
Reduced dimensions make visualization and interpretation easier.

PCA (Principal Component Analysis)

PCA is the most common linear dimensionality reduction method. See the dedicated PCA tutorial for more detail.

t-SNE for Visualization

t‑SNE is a non‑linear technique mainly used for 2D/3D visualization of high‑dimensional data.

Good at preserving local neighbourhood structure.
Not ideal as a preprocessing step for supervised models (mainly for visualization).

from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200,
    random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)

Feature Selection vs Feature Extraction

Feature selection: keep or drop original features (e.g. using mutual information, model‑based importance).
Feature extraction: create new features from combinations of the originals (e.g. PCA, autoencoders).

Practical Workflow

Always start with exploratory data analysis to understand feature distributions and correlations.
Try simple feature selection (drop constant / duplicate / highly correlated features) before heavier methods.
Use PCA or t‑SNE primarily for visualization and insight; validate any dimensionality reduction choice with downstream model performance.

Previous: Gradient Boosting Next: Anomaly Detection