Machine Learning
Ensemble Methods
Bagging, boosting, gradient boosting, and ensemble learning strategies.
Ensemble Learning
Why Ensembles Work
- Individual models make different errors; averaging or voting can cancel out some of this noise.
- Ensembles reduce variance (bagging) or bias (boosting), depending on the method.
- They are a standard tool in winning solutions to ML competitions.
Bagging (Bootstrap Aggregating)
Bagging trains multiple base learners independently on different bootstrap samples of the training data and then averages their predictions.
- Reduces variance of high‑variance models like Decision Trees.
- Random Forest is the most popular bagging‑based ensemble.
Boosting
Boosting trains base learners sequentially, where each new model focuses more on the mistakes of the previous ones.
- Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
- Often achieve state‑of‑the‑art results on tabular data.
Stacking
Stacking combines the outputs of diverse base models (trees, linear models, neural nets) using a meta‑learner.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
estimators = [
("dt", DecisionTreeClassifier(max_depth=5)),
("svm", SVC(probability=True))
]
stack = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
Practical Tips for Ensembles
- Start with simple ensembles like Random Forest before trying more complex stacks.
- Use cross‑validation to generate out‑of‑fold predictions when stacking to avoid leakage.
- Watch out for training time and memory usage, especially with many large base models.
- On tabular data, tree‑based ensembles (Random Forest, Gradient Boosting) are usually the strongest baseline.
Gradient Boosting
Intuition
- Start with a simple base prediction (e.g., mean of targets).
- Fit a new tree to the residuals (errors) of the current model.
- Add this new tree to the ensemble with a learning rate.
- Repeat for many iterations to gradually minimize the loss function.
GradientBoostingClassifier with scikit-learn
Basic Gradient Boosting example
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
gb = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=3,
random_state=42
)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
print(classification_report(y_test, y_pred))
Advanced Gradient Boosting (XGBoost, LightGBM)
Modern gradient boosting libraries add powerful optimizations:
- XGBoost: regularization, tree pruning, parallelization.
- LightGBM: histogram‑based splits, leaf‑wise growth, very fast on large datasets.
- CatBoost: strong support for categorical features.
Key Hyperparameters
- n_estimators: number of boosting stages (too high → overfitting, too low → underfitting).
- learning_rate: how much each tree contributes; lower values often need more trees.
- max_depth / max_leaf_nodes: control tree complexity.
- subsample: using < 1.0 adds randomness and can improve generalization.