Machine Learning
Dimensionality Reduction
PCA, t-SNE, and techniques to reduce feature space for ML.
Principal Component Analysis (PCA)
Intuition
- Find new axes (principal components) that maximize variance.
- Components are orthogonal (uncorrelated).
- We keep only the top components that explain most of the variance.
Explained Variance
We often plot the cumulative explained variance ratio to decide how many components to keep.
PCA with scikit-learn
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA()
pca.fit(X_scaled)
cum_var = pca.explained_variance_ratio_.cumsum()
plt.plot(range(1, len(cum_var) + 1), cum_var, marker="o")
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.grid(True)
plt.show()
Projecting Data
Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap="viridis", s=20)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA projection")
plt.show()
When to Use PCA
- You have many correlated numeric features and want to reduce dimensionality.
- You need to visualize high‑dimensional data in 2D or 3D.
- You want to speed up downstream models (e.g. KNN, clustering) by working in a lower‑dimensional space.
PCA in an ML Pipeline
You can combine PCA with a classifier inside a scikit‑learn Pipeline so scaling, PCA and modeling are applied consistently.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[
("scaler", StandardScaler()),
("pca", PCA(n_components=10)),
("logreg", LogisticRegression(max_iter=1000))
])
Dimensionality Reduction
Why Reduce Dimensions?
- High‑dimensional data can lead to the curse of dimensionality for distance‑based methods like KNN and clustering.
- Redundant or noisy features can harm model performance.
- Reduced dimensions make visualization and interpretation easier.
PCA (Principal Component Analysis)
PCA is the most common linear dimensionality reduction method. See the dedicated PCA tutorial for more detail.
t-SNE for Visualization
t‑SNE is a non‑linear technique mainly used for 2D/3D visualization of high‑dimensional data.
- Good at preserving local neighbourhood structure.
- Not ideal as a preprocessing step for supervised models (mainly for visualization).
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=30,
learning_rate=200,
random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)
Feature Selection vs Feature Extraction
- Feature selection: keep or drop original features (e.g. using mutual information, model‑based importance).
- Feature extraction: create new features from combinations of the originals (e.g. PCA, autoencoders).
Practical Workflow
- Always start with exploratory data analysis to understand feature distributions and correlations.
- Try simple feature selection (drop constant / duplicate / highly correlated features) before heavier methods.
- Use PCA or t‑SNE primarily for visualization and insight; validate any dimensionality reduction choice with downstream model performance.