Machine Learning
Dimensionality Reduction
High-D Data
Dimensionality Reduction
Dimensionality reduction techniques help you simplify high‑dimensional data, speed up models and reduce overfitting while preserving as much structure as possible.
Why Reduce Dimensions?
- High‑dimensional data can lead to the curse of dimensionality for distance‑based methods like KNN and clustering.
- Redundant or noisy features can harm model performance.
- Reduced dimensions make visualization and interpretation easier.
PCA (Principal Component Analysis)
PCA is the most common linear dimensionality reduction method. See the dedicated PCA tutorial for more detail.
t-SNE for Visualization
t‑SNE is a non‑linear technique mainly used for 2D/3D visualization of high‑dimensional data.
- Good at preserving local neighbourhood structure.
- Not ideal as a preprocessing step for supervised models (mainly for visualization).
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=30,
learning_rate=200,
random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)
Feature Selection vs Feature Extraction
- Feature selection: keep or drop original features (e.g. using mutual information, model‑based importance).
- Feature extraction: create new features from combinations of the originals (e.g. PCA, autoencoders).
Practical Workflow
- Always start with exploratory data analysis to understand feature distributions and correlations.
- Try simple feature selection (drop constant / duplicate / highly correlated features) before heavier methods.
- Use PCA or t‑SNE primarily for visualization and insight; validate any dimensionality reduction choice with downstream model performance.