Machine Learning
Ensemble Learning
From Basics to Advanced
Ensemble Learning
Ensemble methods combine multiple base models to build a stronger overall learner that is usually more accurate and robust than any single model.
Why Ensembles Work
- Individual models make different errors; averaging or voting can cancel out some of this noise.
- Ensembles reduce variance (bagging) or bias (boosting), depending on the method.
- They are a standard tool in winning solutions to ML competitions.
Bagging (Bootstrap Aggregating)
Bagging trains multiple base learners independently on different bootstrap samples of the training data and then averages their predictions.
- Reduces variance of high‑variance models like Decision Trees.
- Random Forest is the most popular bagging‑based ensemble.
Boosting
Boosting trains base learners sequentially, where each new model focuses more on the mistakes of the previous ones.
- Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
- Often achieve state‑of‑the‑art results on tabular data.
Stacking
Stacking combines the outputs of diverse base models (trees, linear models, neural nets) using a meta‑learner.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
estimators = [
("dt", DecisionTreeClassifier(max_depth=5)),
("svm", SVC(probability=True))
]
stack = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
Practical Tips for Ensembles
- Start with simple ensembles like Random Forest before trying more complex stacks.
- Use cross‑validation to generate out‑of‑fold predictions when stacking to avoid leakage.
- Watch out for training time and memory usage, especially with many large base models.
- On tabular data, tree‑based ensembles (Random Forest, Gradient Boosting) are usually the strongest baseline.