← Hub
P(Spam)0.30
P("free" | Spam)0.80
P("free" | Ham)0.10
P(Spam | "free")0.774
Topic 10 · Phase 2 · Animated

Update Your Beliefs

Naive Bayes — probabilistic classification using Bayes' theorem

1

The Problem

A news aggregator wants to classify articles into categories (politics, sports, tech) based on the words they contain. The feature space is huge (vocabulary = 50,000 words) but most models are too slow.

The challenge

  • 50,000-dimensional feature space
  • Need instant predictions
  • Limited training data
Why Naive Bayes works here:
Assumes word independence — a heroic but useful simplification. Counting word frequencies is O(n) and instant to predict.
"The 'naive' part means we ignore correlations between features. The model is wrong in theory but surprisingly right in practice."
2

The Intuition

Start with a prior belief (base rate of each class). Then update that belief as you observe each word in the document. Each word shifts the probability up or down. After all words, pick the most probable class.

Live probability update:
3

The Math

Bayes' Theorem
P(C|X) = P(X|C) · P(C) / P(X)
Naive Independence Assumption
P(X|C) = Π P(xᵢ|C)  (features are conditionally independent)
Decision Rule (log-space for stability)
ŷ = argmax_c [ log P(C) + Σᵢ log P(xᵢ|C) ]
4

Assumptions & Pitfalls

Independence assumption is almost never true. "free money" is correlated — but the model treats them independently. Works anyway because we only need the ranking, not accurate probabilities.
Zero-frequency problem. A word never seen in training → P = 0 → collapses entire product. Fix with Laplace (add-1) smoothing.
5

When to Use

Strengths

  • Extremely fast to train and predict
  • Works well with small data
  • Scales to huge feature spaces
  • Handles multiclass naturally

Limitations

  • Independence assumption violated in practice
  • Probabilities are poorly calibrated
  • Continuous features need distribution assumption
  • Outperformed by modern NLP (transformers)
Spam filtering News classification Sentiment analysis Real-time text categorisation

Key Takeaways

Prior × Likelihood → Posterior

Bayes' theorem turns base rates and observed evidence into an updated belief. Each word is evidence that shifts the probability.

Surprisingly Effective Despite Being "Naive"

The independence assumption is wrong — but for many text tasks, Naive Bayes rivals much more complex models.

Always Use Laplace Smoothing

Add-1 smoothing prevents zero-probability collapse for unseen words. A tiny constant that makes the model robust.