← Hub
Topic 09 · Phase 2 · Animated

Maximum Margin Classifier

Support Vector Machines — find the hyperplane with the widest possible margin

1

The Problem

An email spam classifier needs to separate spam from ham in a high-dimensional feature space (word frequencies). We want a boundary that generalises well — not just one that fits training data.

Narrow margin
Close to data — risky generalisation
↔️
Maximum margin
Furthest from both classes — best generalisation
"SVM doesn't just find any boundary — it finds the widest road between the classes."
2

The Intuition

Imagine the two classes as two groups of points. Draw parallel lanes (margins) around the dividing hyperplane. Widen the lanes as much as possible while keeping them empty. The points on the lane edges are the support vectors.

Support Vectors
Points on the margin boundary — all others don't matter
Hyperplane
Decision boundary w·x + b = 0
Kernel Trick
Map to higher dimensions implicitly — no computation
3

The Math

Optimisation Objective
Maximise 2/‖w‖ ⟺ Minimise ½‖w‖²

Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Soft Margin (C parameter)
Min ½‖w‖² + C Σ ξᵢ

C controls margin width vs. misclassification penalty. High C = narrow margin, fewer errors.

Common Kernels

Linear — K(x,z) = xᵀz  |  RBF — exp(−γ‖x−z‖²)  |  Polynomial — (xᵀz + c)ᵈ

4

Assumptions & Pitfalls

Slow on large datasets. SVMs scale O(n²–n³) with training size. Not suitable for millions of examples.
Feature scaling is essential. RBF kernel is distance-based — always standardise features before fitting.
Kernel choice and C/γ tuning. RBF with default C=1, γ='scale' is a reasonable start. Grid-search C and γ on validation data.
5

When to Use

Strengths

  • Effective in high-dimensional spaces
  • Works well with small datasets
  • Kernel trick handles non-linear boundaries
  • Memory-efficient — only support vectors stored

Limitations

  • Slow on large training sets
  • No probability outputs by default
  • Kernel and C/γ selection non-trivial
  • Doesn't scale well multiclass
Text classification Image recognition (small) Bioinformatics Anomaly detection

Key Takeaways

↔️

Maximise the Margin

SVM's entire objective is to find the boundary with the widest gap between classes — width = generalisation.

Kernel Trick is Free

The kernel computes dot products in high-dimensional space without ever materialising that space — computationally cheap.

Only Support Vectors Matter

Once trained, predictions depend only on the support vectors — the boundary points. All other training examples are irrelevant.