Model Selection — cross-validation, information criteria, and hyperparameter tuning
You've trained a linear regression, a Random Forest, and an XGBoost on the same dataset. They all look good on training data. Which one will generalise to unseen data? You can't tell without a rigorous evaluation framework.
K-Fold cross-validation splits data into K equal folds. Train on K-1 folds, test on the remaining fold. Rotate until every fold has been the test set. Average the K scores. This uses all your data for both training and testing.
CV = (1/K) Σₖ score(model trained on D\Dₖ, evaluated on Dₖ)AIC = 2k − 2ln(L̂) | BIC = k·ln(n) − 2ln(L̂)k = parameters, L̂ = likelihood, n = observations. Lower = better.
GridSearchCV — exhaustive (slow). RandomizedSearchCV — random sample (faster). Optuna/Hyperopt — Bayesian (smartest).
Never compare models on the training set. 5-fold CV is the minimum; 10-fold for smaller datasets.
Put all preprocessing inside a Pipeline and inside the CV loop. A single leaked data point can produce optimistically wrong results.
Hold out a final test set. Touch it exactly once — after all tuning is done. Its result is your honest performance estimate.