Principal Component Analysis (PCA)

PCA transforms correlated features into a smaller set of uncorrelated components that capture most of the variance in the data.

Intuition

Find new axes (principal components) that maximize variance.
Components are orthogonal (uncorrelated).
We keep only the top components that explain most of the variance.

Explained Variance

We often plot the cumulative explained variance ratio to decide how many components to keep.

PCA with scikit-learn

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
pca.fit(X_scaled)

cum_var = pca.explained_variance_ratio_.cumsum()
plt.plot(range(1, len(cum_var) + 1), cum_var, marker="o")
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.grid(True)
plt.show()

Projecting Data

Reduce to 2D for visualization

pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap="viridis", s=20)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA projection")
plt.show()

When to Use PCA

You have many correlated numeric features and want to reduce dimensionality.
You need to visualize high‑dimensional data in 2D or 3D.
You want to speed up downstream models (e.g. KNN, clustering) by working in a lower‑dimensional space.

PCA in an ML Pipeline

You can combine PCA with a classifier inside a scikit‑learn Pipeline so scaling, PCA and modeling are applied consistently.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("logreg", LogisticRegression(max_iter=1000))
])

Previous: Naive Bayes Next: Ensemble Learning