← Hub
β₁
β₂
β₃
β₄
β₅
β₁'
β₂'
β₃'
β₄'
β₅'
Topic 04 · Phase 1 · Animated

Shrink to Generalise

Regularization — taming overfitting by penalising large coefficients

1

The Problem

Your linear regression model fits the training data perfectly — R² = 0.99 — but performs terribly on new data. The model has learned the noise, not the signal. This is overfitting.

Overfit Model
0.99
Train R²
0.41
Test R²
Regularised Model
0.91
Train R²
0.87
Test R²
"A model that memorises the training set has learned nothing useful."
2

The Intuition

Add a penalty term to the loss function that punishes large coefficients. The model now balances fitting the data and keeping the coefficients small.

Ridge (L2)
Adds sum of squared coefficients to loss. Shrinks all coefficients toward zero — but never exactly to zero. Handles multicollinearity well.
Lasso (L1)
Adds sum of absolute coefficients to loss. Can shrink coefficients exactly to zero — performs automatic feature selection.
ElasticNet
Combines L1 + L2. One hyperparameter (α) controls the mix. Best of both worlds when features are many and correlated.
3

The Math

Ridge Loss
J = MSE + λ Σ βⱼ²
Lasso Loss
J = MSE + λ Σ |βⱼ|
Key Insight

λ (alpha) is the regularisation strength. λ=0 → plain regression. λ→∞ → all coefficients → 0. Choose via cross-validation.

4

Assumptions & Pitfalls

Scale matters. Always standardise features before regularising. A feature measured in thousands will be penalised less than one measured in ones.
λ too large = underfitting. If the penalty is too harsh, all useful coefficients get suppressed. Use cross-validation to find the Goldilocks λ.
Lasso instability. When features are highly correlated, Lasso arbitrarily picks one and zeroes the rest. ElasticNet is more stable.
5

When to Use

Strengths

  • Directly combats overfitting
  • Lasso gives automatic feature selection
  • Ridge handles multicollinearity
  • Interpretable — still a linear model

Limitations

  • Adds a hyperparameter (λ) to tune
  • Features must be scaled first
  • Doesn't help with non-linearity
  • Lasso unstable with correlated features
High-dimensional regression Feature selection (Lasso) Multicollinear data (Ridge)

Key Takeaways

Penalty Buys Generalisation

Accepting slightly worse training performance in exchange for much better test performance is the whole point.

0️⃣

Lasso = Sparse Models

Lasso's L1 penalty can zero out irrelevant features entirely — useful when you have hundreds of features and want interpretability.

λ Controls the Tradeoff

Find optimal λ with cross-validation. Plot a regularisation path — coefficient values vs log(λ) — to understand what's being penalised.