Machine Learning

Ensemble Methods

Bagging, boosting, gradient boosting, and ensemble learning strategies.

Ensemble Learning

Why Ensembles Work

  • Individual models make different errors; averaging or voting can cancel out some of this noise.
  • Ensembles reduce variance (bagging) or bias (boosting), depending on the method.
  • They are a standard tool in winning solutions to ML competitions.

Bagging (Bootstrap Aggregating)

Bagging trains multiple base learners independently on different bootstrap samples of the training data and then averages their predictions.

  • Reduces variance of high‑variance models like Decision Trees.
  • Random Forest is the most popular bagging‑based ensemble.

Boosting

Boosting trains base learners sequentially, where each new model focuses more on the mistakes of the previous ones.

  • Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
  • Often achieve state‑of‑the‑art results on tabular data.

Stacking

Stacking combines the outputs of diverse base models (trees, linear models, neural nets) using a meta‑learner.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

estimators = [
    ("dt", DecisionTreeClassifier(max_depth=5)),
    ("svm", SVC(probability=True))
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

Practical Tips for Ensembles

  • Start with simple ensembles like Random Forest before trying more complex stacks.
  • Use cross‑validation to generate out‑of‑fold predictions when stacking to avoid leakage.
  • Watch out for training time and memory usage, especially with many large base models.
  • On tabular data, tree‑based ensembles (Random Forest, Gradient Boosting) are usually the strongest baseline.

Gradient Boosting

Intuition

  • Start with a simple base prediction (e.g., mean of targets).
  • Fit a new tree to the residuals (errors) of the current model.
  • Add this new tree to the ensemble with a learning rate.
  • Repeat for many iterations to gradually minimize the loss function.

GradientBoostingClassifier with scikit-learn

Basic Gradient Boosting example
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)
gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)
print(classification_report(y_test, y_pred))

Advanced Gradient Boosting (XGBoost, LightGBM)

Modern gradient boosting libraries add powerful optimizations:

  • XGBoost: regularization, tree pruning, parallelization.
  • LightGBM: histogram‑based splits, leaf‑wise growth, very fast on large datasets.
  • CatBoost: strong support for categorical features.

Key Hyperparameters

  • n_estimators: number of boosting stages (too high → overfitting, too low → underfitting).
  • learning_rate: how much each tree contributes; lower values often need more trees.
  • max_depth / max_leaf_nodes: control tree complexity.
  • subsample: using < 1.0 adds randomness and can improve generalization.