Topic 09 · Phase 2 · Animated

Maximum Margin Classifier

Support Vector Machines — find the hyperplane with the widest possible margin

↓

1

The Problem

An email spam classifier needs to separate spam from ham in a high-dimensional feature space (word frequencies). We want a boundary that generalises well — not just one that fits training data.

Narrow margin

Close to data — risky generalisation

↔️

Maximum margin

Furthest from both classes — best generalisation

"SVM doesn't just find any boundary — it finds the widest road between the classes."

2

The Intuition

Imagine the two classes as two groups of points. Draw parallel lanes (margins) around the dividing hyperplane. Widen the lanes as much as possible while keeping them empty. The points on the lane edges are the support vectors.

Support Vectors

Points on the margin boundary — all others don't matter

⬛

Hyperplane

Decision boundary w·x + b = 0

Kernel Trick

Map to higher dimensions implicitly — no computation

3

The Math

Optimisation Objective

Maximise 2/‖w‖ ⟺ Minimise ½‖w‖²

Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Soft Margin (C parameter)

Min ½‖w‖² + C Σ ξᵢ

C controls margin width vs. misclassification penalty. High C = narrow margin, fewer errors.

Common Kernels

Linear — K(x,z) = xᵀz | RBF — exp(−γ‖x−z‖²) | Polynomial — (xᵀz + c)ᵈ

4

Assumptions & Pitfalls

Slow on large datasets. SVMs scale O(n²–n³) with training size. Not suitable for millions of examples.

Feature scaling is essential. RBF kernel is distance-based — always standardise features before fitting.

Kernel choice and C/γ tuning. RBF with default C=1, γ='scale' is a reasonable start. Grid-search C and γ on validation data.

5

When to Use

Strengths

Effective in high-dimensional spaces
Works well with small datasets
Kernel trick handles non-linear boundaries
Memory-efficient — only support vectors stored

Limitations

Slow on large training sets
No probability outputs by default
Kernel and C/γ selection non-trivial
Doesn't scale well multiclass

Text classification Image recognition (small) Bioinformatics Anomaly detection

Key Takeaways

↔️

Maximise the Margin

SVM's entire objective is to find the boundary with the widest gap between classes — width = generalisation.

Kernel Trick is Free

The kernel computes dot products in high-dimensional space without ever materialising that space — computationally cheap.

Only Support Vectors Matter

Once trained, predictions depend only on the support vectors — the boundary points. All other training examples are irrelevant.