Machine Learning Dimensionality Reduction
High-D Data

Dimensionality Reduction

Dimensionality reduction techniques help you simplify high‑dimensional data, speed up models and reduce overfitting while preserving as much structure as possible.

Why Reduce Dimensions?

  • High‑dimensional data can lead to the curse of dimensionality for distance‑based methods like KNN and clustering.
  • Redundant or noisy features can harm model performance.
  • Reduced dimensions make visualization and interpretation easier.

t-SNE for Visualization

t‑SNE is a non‑linear technique mainly used for 2D/3D visualization of high‑dimensional data.

  • Good at preserving local neighbourhood structure.
  • Not ideal as a preprocessing step for supervised models (mainly for visualization).
from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200,
    random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)

Feature Selection vs Feature Extraction

  • Feature selection: keep or drop original features (e.g. using mutual information, model‑based importance).
  • Feature extraction: create new features from combinations of the originals (e.g. PCA, autoencoders).

Practical Workflow

  • Always start with exploratory data analysis to understand feature distributions and correlations.
  • Try simple feature selection (drop constant / duplicate / highly correlated features) before heavier methods.
  • Use PCA or t‑SNE primarily for visualization and insight; validate any dimensionality reduction choice with downstream model performance.