← Hub
Age ≤ 35? Income≤50k? Usage≤10? Stay Y Churn N Churn N Stay Y
Topic 06 · Phase 2 · Interactive

Twenty Questions for Data

Decision Trees — recursive splitting that mirrors human decision-making

1

The Problem

A hospital wants to predict heart disease risk from patient data (age, chest pain, cholesterol). Doctors need an explainable model — not a black box. Decision trees produce human-readable rules.

What we need

  • Classify patients into risk categories
  • Handle numerical and categorical features
  • Produce rules a doctor can explain
  • No feature scaling required
Typical rule generated:
if age > 55 AND
chest_pain = typical AND
chol > 240:
  → HIGH RISK
"A decision tree is essentially the game of 20 Questions played with features."
2

The Intuition

At each node, the tree asks: "Which feature and threshold, if I split on it right now, creates the purest possible child groups?" It greedily picks the best split, then recurses on each child.

Impure Node
Mixed classes — hard to make a decision
Pure Leaf
All examples of one class — decision made

Interactive Demo — Classify a Patient

Answer each question to traverse the decision tree.

3

The Math

Gini Impurity
Gini = 1 − Σ pᵢ²

0 = perfectly pure, 0.5 = maximally impure (2 classes)

Information Gain (Entropy)
IG = H(parent) − Σ (|child|/|parent|) · H(child)
Split Selection

At each node, try all features and all thresholds. Pick the split that maximises Information Gain or minimises Gini. Greedy — no backtracking.

4

Assumptions & Pitfalls

Overfitting. Deep trees memorise training data. Control with max_depth, min_samples_leaf, and pruning.
Greedy splits are locally optimal. A split that looks bad now might enable better splits later. Trees can miss global optima.
Unstable to small changes. Slight data changes can yield very different trees. Random Forest solves this.
5

When to Use

Strengths

  • Fully interpretable — generates rules
  • No feature scaling needed
  • Handles mixed data types
  • Captures non-linear boundaries

Limitations

  • Overfits easily without pruning
  • High variance — small data changes → big tree changes
  • Not great for regression (step function output)
  • Biased towards high-cardinality features
Medical diagnosis rules Credit scoring Fraud rules Base learner for ensembles

Key Takeaways

Splits by Purity

Every split is chosen to maximise information gain or minimise Gini impurity — a measure of how mixed the classes are.

Prune to Generalise

max_depth and min_samples_leaf are the most important hyperparameters. Start shallow and add depth only if validation improves.

Building Block of Ensembles

A single tree is weak but interpretable. Combine 100 trees in Random Forest or Boosting for state-of-the-art accuracy.