Update Your Beliefs

Naive Bayes — probabilistic classification using Bayes' theorem

↓

1

The Problem

A news aggregator wants to classify articles into categories (politics, sports, tech) based on the words they contain. The feature space is huge (vocabulary = 50,000 words) but most models are too slow.

The challenge

•50,000-dimensional feature space
•Need instant predictions
•Limited training data

Why Naive Bayes works here:

Assumes word independence — a heroic but useful simplification. Counting word frequencies is O(n) and instant to predict.

"The 'naive' part means we ignore correlations between features. The model is wrong in theory but surprisingly right in practice."

2

The Intuition

Start with a prior belief (base rate of each class). Then update that belief as you observe each word in the document. Each word shifts the probability up or down. After all words, pick the most probable class.

Live probability update:

3

The Math

Bayes' Theorem

P(C|X) = P(X|C) · P(C) / P(X)

Naive Independence Assumption

P(X|C) = Π P(xᵢ|C) (features are conditionally independent)

Decision Rule (log-space for stability)

ŷ = argmax_c [ log P(C) + Σᵢ log P(xᵢ|C) ]

4

Assumptions & Pitfalls

Independence assumption is almost never true. "free money" is correlated — but the model treats them independently. Works anyway because we only need the ranking, not accurate probabilities.

Zero-frequency problem. A word never seen in training → P = 0 → collapses entire product. Fix with Laplace (add-1) smoothing.

5

When to Use

Strengths

Extremely fast to train and predict
Works well with small data
Scales to huge feature spaces
Handles multiclass naturally

Limitations

Independence assumption violated in practice
Probabilities are poorly calibrated
Continuous features need distribution assumption
Outperformed by modern NLP (transformers)

Spam filtering News classification Sentiment analysis Real-time text categorisation

Key Takeaways

Prior × Likelihood → Posterior

Bayes' theorem turns base rates and observed evidence into an updated belief. Each word is evidence that shifts the probability.

Surprisingly Effective Despite Being "Naive"

The independence assumption is wrong — but for many text tasks, Naive Bayes rivals much more complex models.

Always Use Laplace Smoothing

Add-1 smoothing prevents zero-probability collapse for unseen words. A tiny constant that makes the model robust.