Topic 08 · Phase 2 · Animated

Learn from Your Mistakes

Boosting — sequential learners that each correct the errors of the previous

↓

1

The Problem

Random Forest builds trees in parallel independently. Can we do better by building trees sequentially, each one focused on fixing the mistakes of the previous ones?

×3 parallel

Random Forest

Parallel, independent trees

Boosting

Sequential, error-correcting

"XGBoost wins more Kaggle competitions than any other algorithm — boosting done right is nearly unbeatable on structured data."

2

The Intuition

Each new tree is trained to predict the residuals (errors) of all previous trees combined. The final prediction sums up all the trees, each weighted by a learning rate.

1

Tree 1 fits data → residuals are large

2

Tree 2 fits residuals of Tree 1 → residuals shrink

N

Tree N fits tiny residuals → near-perfect fit

Final: ŷ = η·f₁(x) + η·f₂(x) + … + η·fₙ(x) where η = learning rate

3

The Math

Gradient Boosting Update

F_m(x) = F_{m-1}(x) + η · h_m(x)

h_m fits the negative gradient of the loss (= residuals for MSE loss)

XGBoost Adds Regularisation

L = Σ l(yᵢ, ŷᵢ) + Σ Ω(fₖ) where Ω = γT + ½λ‖w‖²

Key Hyperparameters

n_estimators — number of trees | learning_rate (η) — step size | max_depth — tree complexity

4

Assumptions & Pitfalls

Overfitting with too many trees. Unlike Random Forest, boosting can overfit if n_estimators is too large. Use early stopping.

Sensitive to outliers. Boosting focuses on hard examples — outliers get disproportionate weight. Use Huber loss.

Slow training. Sequential by nature. XGBoost and LightGBM use optimisations (histogram-based, GPU) to speed this up.

5

When to Use

Strengths

State-of-the-art on structured/tabular data
Handles mixed feature types
Built-in regularisation (XGBoost)
Flexible loss functions

Limitations

Can overfit (needs early stopping)
More hyperparameters than Random Forest
Not interpretable
Slow on very large datasets (vs LightGBM)

Kaggle competitions Click-through rate prediction Credit risk Ranking problems

Key Takeaways

Each Tree Corrects Errors

New trees focus on the hardest examples — the ones previous trees got wrong. Errors shrink with each round.

Learning Rate Matters

Lower η = more trees needed but better generalisation. Common practice: 0.01–0.1 with early stopping.

XGBoost / LightGBM

Modern boosting adds L1/L2 regularisation, histogram tricks, and GPU support. Always prefer XGBoost over vanilla GBM.