Topic 07 · Phase 2 · Animated

Wisdom of Many Trees

Random Forest — bagging and random feature subsets make weak learners powerful

↓

1

The Problem

A single decision tree overfits easily and changes dramatically with small data variations. Can we combine many imperfect trees to get a stable, accurate model?

One Tree

High variance · Overfits easily · Unstable

100 Trees → Vote

Low variance · Robust · Accurate

"A forest of weak learners, each seeing a different random slice of the data, collectively becomes a strong learner."

2

The Intuition

Two sources of randomness reduce correlation between trees — making their errors independent and thus cancellable:

1. Bootstrap Sampling (Bagging)

Each tree is trained on a different bootstrap sample — sample n rows with replacement from the original data. ~63% unique rows per tree.

2. Random Feature Subsets

At each split, only consider √p (classification) or p/3 (regression) randomly chosen features. Prevents all trees from relying on the same dominant feature.

3. Majority Vote / Average

Classification: take the mode of all tree predictions. Regression: take the mean. Individual errors cancel out.

3

The Math

Bagging (Bootstrap Aggregating)

ŷ = (1/B) Σᵦ fᵦ(x) where fᵦ is tree b

OOB Error (Out-of-Bag)

The ~37% rows not in each bootstrap sample act as a free validation set. OOB error is an unbiased estimate of generalisation error — no separate test split needed.

Feature Importance

FI(j) = Σᵦ Σₙ (impurity decrease when splitting on feature j)

4

Assumptions & Pitfalls

Not interpretable. 500 trees can't be written as human-readable rules. Feature importance is a proxy, not a full explanation.

Slow prediction. Each prediction runs through B trees. For real-time latency-sensitive systems, consider smaller B or XGBoost.

Still needs enough features. If all features are irrelevant, even 1000 trees won't help. Feature engineering still matters.

5

When to Use

Strengths

High accuracy with minimal tuning
Free OOB validation
Built-in feature importance
Handles missing values and outliers well

Limitations

Not interpretable
Slow prediction at inference
Memory-intensive for large forests
Outperformed by Gradient Boosting on structured data

Telecom churn Fraud detection Genomics feature selection General tabular classification

Key Takeaways

Randomness Reduces Correlation

Bootstrap sampling and random feature subsets ensure trees make different errors — errors that cancel when averaged.

OOB = Free Validation

No need for a dedicated test split during training — out-of-bag samples provide an unbiased estimate of generalisation error.

Strong Baseline

Random Forest with default hyperparameters often outperforms carefully tuned linear models. It's the first non-linear model to try.