Machine Learning Ensemble Learning

From Basics to Advanced

Ensemble Learning

Ensemble methods combine multiple base models to build a stronger overall learner that is usually more accurate and robust than any single model.

Why Ensembles Work

Individual models make different errors; averaging or voting can cancel out some of this noise.
Ensembles reduce variance (bagging) or bias (boosting), depending on the method.
They are a standard tool in winning solutions to ML competitions.

Bagging (Bootstrap Aggregating)

Bagging trains multiple base learners independently on different bootstrap samples of the training data and then averages their predictions.

Reduces variance of high‑variance models like Decision Trees.
Random Forest is the most popular bagging‑based ensemble.

Boosting

Boosting trains base learners sequentially, where each new model focuses more on the mistakes of the previous ones.

Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
Often achieve state‑of‑the‑art results on tabular data.

Stacking

Stacking combines the outputs of diverse base models (trees, linear models, neural nets) using a meta‑learner.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

estimators = [
    ("dt", DecisionTreeClassifier(max_depth=5)),
    ("svm", SVC(probability=True))
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

Practical Tips for Ensembles

Start with simple ensembles like Random Forest before trying more complex stacks.
Use cross‑validation to generate out‑of‑fold predictions when stacking to avoid leakage.
Watch out for training time and memory usage, especially with many large base models.
On tabular data, tree‑based ensembles (Random Forest, Gradient Boosting) are usually the strongest baseline.

Previous: PCA Next: Gradient Boosting