Machine Learning

Dimensionality Reduction

PCA, t-SNE, and techniques to reduce feature space for ML.

Principal Component Analysis (PCA)

Intuition

Find new axes (principal components) that maximize variance.
Components are orthogonal (uncorrelated).
We keep only the top components that explain most of the variance.

Explained Variance

We often plot the cumulative explained variance ratio to decide how many components to keep.

PCA with scikit-learn

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
pca.fit(X_scaled)

cum_var = pca.explained_variance_ratio_.cumsum()
plt.plot(range(1, len(cum_var) + 1), cum_var, marker="o")
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.grid(True)
plt.show()

Projecting Data

Reduce to 2D for visualization

pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap="viridis", s=20)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA projection")
plt.show()

When to Use PCA

You have many correlated numeric features and want to reduce dimensionality.
You need to visualize highâ€‘dimensional data in 2D or 3D.
You want to speed up downstream models (e.g. KNN, clustering) by working in a lowerâ€‘dimensional space.

PCA in an ML Pipeline

You can combine PCA with a classifier inside a scikitâ€‘learn Pipeline so scaling, PCA and modeling are applied consistently.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("logreg", LogisticRegression(max_iter=1000))
])

Dimensionality Reduction

Why Reduce Dimensions?

Highâ€‘dimensional data can lead to the curse of dimensionality for distanceâ€‘based methods like KNN and clustering.
Redundant or noisy features can harm model performance.
Reduced dimensions make visualization and interpretation easier.

PCA (Principal Component Analysis)

PCA is the most common linear dimensionality reduction method. See the dedicated PCA tutorial for more detail.

t-SNE for Visualization

tâ€‘SNE is a nonâ€‘linear technique mainly used for 2D/3D visualization of highâ€‘dimensional data.

Good at preserving local neighbourhood structure.
Not ideal as a preprocessing step for supervised models (mainly for visualization).

from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200,
    random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)

Feature Selection vs Feature Extraction

Feature selection: keep or drop original features (e.g. using mutual information, modelâ€‘based importance).
Feature extraction: create new features from combinations of the originals (e.g. PCA, autoencoders).

Practical Workflow

Always start with exploratory data analysis to understand feature distributions and correlations.
Try simple feature selection (drop constant / duplicate / highly correlated features) before heavier methods.
Use PCA or tâ€‘SNE primarily for visualization and insight; validate any dimensionality reduction choice with downstream model performance.

Previous Next