Machine Learning Ensemble Learning
From Basics to Advanced

Ensemble Learning

Ensemble methods combine multiple base models to build a stronger overall learner that is usually more accurate and robust than any single model.

Why Ensembles Work

  • Individual models make different errors; averaging or voting can cancel out some of this noise.
  • Ensembles reduce variance (bagging) or bias (boosting), depending on the method.
  • They are a standard tool in winning solutions to ML competitions.

Bagging (Bootstrap Aggregating)

Bagging trains multiple base learners independently on different bootstrap samples of the training data and then averages their predictions.

  • Reduces variance of high‑variance models like Decision Trees.
  • Random Forest is the most popular bagging‑based ensemble.

Boosting

Boosting trains base learners sequentially, where each new model focuses more on the mistakes of the previous ones.

  • Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
  • Often achieve state‑of‑the‑art results on tabular data.

Stacking

Stacking combines the outputs of diverse base models (trees, linear models, neural nets) using a meta‑learner.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

estimators = [
    ("dt", DecisionTreeClassifier(max_depth=5)),
    ("svm", SVC(probability=True))
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

Practical Tips for Ensembles

  • Start with simple ensembles like Random Forest before trying more complex stacks.
  • Use cross‑validation to generate out‑of‑fold predictions when stacking to avoid leakage.
  • Watch out for training time and memory usage, especially with many large base models.
  • On tabular data, tree‑based ensembles (Random Forest, Gradient Boosting) are usually the strongest baseline.