β₁

β₂

β₃

β₄

β₅

β₁'

β₂'

β₃'

β₄'

β₅'

Topic 04 · Phase 1 · Animated

Shrink to Generalise

Regularization — taming overfitting by penalising large coefficients

↓

1

The Problem

Your linear regression model fits the training data perfectly — R² = 0.99 — but performs terribly on new data. The model has learned the noise, not the signal. This is overfitting.

Overfit Model

0.99

Train R²

0.41

Test R²

Regularised Model

0.91

Train R²

0.87

Test R²

"A model that memorises the training set has learned nothing useful."

2

The Intuition

Add a penalty term to the loss function that punishes large coefficients. The model now balances fitting the data and keeping the coefficients small.

Ridge (L2)

Adds sum of squared coefficients to loss. Shrinks all coefficients toward zero — but never exactly to zero. Handles multicollinearity well.

Lasso (L1)

Adds sum of absolute coefficients to loss. Can shrink coefficients exactly to zero — performs automatic feature selection.

ElasticNet

Combines L1 + L2. One hyperparameter (α) controls the mix. Best of both worlds when features are many and correlated.

3

The Math

Ridge Loss

J = MSE + λ Σ βⱼ²

Lasso Loss

J = MSE + λ Σ |βⱼ|

Key Insight

λ (alpha) is the regularisation strength. λ=0 → plain regression. λ→∞ → all coefficients → 0. Choose via cross-validation.

4

Assumptions & Pitfalls

Scale matters. Always standardise features before regularising. A feature measured in thousands will be penalised less than one measured in ones.

λ too large = underfitting. If the penalty is too harsh, all useful coefficients get suppressed. Use cross-validation to find the Goldilocks λ.

Lasso instability. When features are highly correlated, Lasso arbitrarily picks one and zeroes the rest. ElasticNet is more stable.

5

When to Use

Strengths

Directly combats overfitting
Lasso gives automatic feature selection
Ridge handles multicollinearity
Interpretable — still a linear model

Limitations

Adds a hyperparameter (λ) to tune
Features must be scaled first
Doesn't help with non-linearity
Lasso unstable with correlated features

High-dimensional regression Feature selection (Lasso) Multicollinear data (Ridge)

Key Takeaways

Penalty Buys Generalisation

Accepting slightly worse training performance in exchange for much better test performance is the whole point.

0️⃣

Lasso = Sparse Models

Lasso's L1 penalty can zero out irrelevant features entirely — useful when you have hundreds of features and want interpretability.

λ Controls the Tradeoff

Find optimal λ with cross-validation. Plot a regularisation path — coefficient values vs log(λ) — to understand what's being penalised.