← Hub
Topic 13 · Phase 3 · Animated

Which Model Should You Trust?

Model Selection — cross-validation, information criteria, and hyperparameter tuning

1

The Problem

You've trained a linear regression, a Random Forest, and an XGBoost on the same dataset. They all look good on training data. Which one will generalise to unseen data? You can't tell without a rigorous evaluation framework.

Linear
Train: 0.81
Test: ???
RF
Train: 0.97
Test: ???
XGBoost
Train: 0.95
Test: ???
"Training performance tells you about the past. Cross-validation tells you about the future."
2

The Intuition

K-Fold cross-validation splits data into K equal folds. Train on K-1 folds, test on the remaining fold. Rotate until every fold has been the test set. Average the K scores. This uses all your data for both training and testing.

5-Fold CV: each fold takes a turn as test set
3

The Math

K-Fold CV Score
CV = (1/K) Σₖ score(model trained on D\Dₖ, evaluated on Dₖ)
AIC / BIC (penalise complexity)
AIC = 2k − 2ln(L̂)  |  BIC = k·ln(n) − 2ln(L̂)

k = parameters, L̂ = likelihood, n = observations. Lower = better.

Hyperparameter Search

GridSearchCV — exhaustive (slow). RandomizedSearchCV — random sample (faster). Optuna/Hyperopt — Bayesian (smartest).

4

Assumptions & Pitfalls

Data leakage. Preprocessing (scaling, feature selection) must happen inside the CV loop, not before. Otherwise you've snooped at the test data.
Test set must be held out until the end. Once you use the test set to tune hyperparameters, it's no longer a test set — it becomes validation data.
Time series: no random splits. For temporal data, always use TimeSeriesSplit — train on past, test on future. Never shuffle.
5

When to Use

Strengths

  • Unbiased estimate of generalisation error
  • Uses all data efficiently
  • Detects data leakage and overfit
  • Enables fair model comparisons

Limitations

  • K × training time cost
  • Variance in small datasets
  • Wrong for time series without care
  • Doesn't replace a true held-out test set
Any model comparison Hyperparameter tuning Feature selection

Key Takeaways

Cross-Validate Everything

Never compare models on the training set. 5-fold CV is the minimum; 10-fold for smaller datasets.

Avoid Data Leakage

Put all preprocessing inside a Pipeline and inside the CV loop. A single leaked data point can produce optimistically wrong results.

One True Test Set

Hold out a final test set. Touch it exactly once — after all tuning is done. Its result is your honest performance estimate.