Topic 11 · Phase 3 · Interactive

Find the Natural Groups

Clustering — unsupervised discovery of structure in unlabelled data

↓

1

The Problem

An e-commerce company has 1 million customers but no labels. Can we find natural groups of similar customers — high-value loyal buyers, one-time discount hunters — without anyone telling us the groups exist?

Unsupervised challenge

•No labels — only features
•Unknown number of groups
•No "correct" answer to evaluate against

Use RFM features: Recency, Frequency, Monetary value — classic clustering features for retail segmentation.

"Clustering reveals the hidden architecture of your data — groups that didn't have names until the algorithm found them."

2

The Intuition

K-Means: place K centroids randomly, assign each point to nearest centroid, move each centroid to the mean of its points. Repeat until nothing moves. Simple, fast, effective.

1

Place K centroids at random positions

2

Assign each point to nearest centroid (Euclidean distance)

3

Move each centroid to the mean of its assigned points

3

Repeat 2–3 until centroids stop moving (convergence)

3

The Math

K-Means Objective (minimise within-cluster variance)

J = Σₖ Σₓ∈Cₖ ‖x − μₖ‖²

Silhouette Score (−1 to +1)

s = (b − a) / max(a, b)

a = mean intra-cluster distance, b = mean nearest-cluster distance

Elbow Method for K

Plot inertia (J) vs K. The "elbow" — where the curve bends — is the optimal K.

4

Assumptions & Pitfalls

K-Means assumes spherical clusters of equal size. Works badly on elongated or non-convex clusters. Use DBSCAN or GMM instead.

Sensitive to initialisation. Always use K-Means++ initialisation — it places centroids far apart initially, reducing bad local optima.

Scale your features. K-Means uses Euclidean distance. A feature measured in thousands will dominate. Always standardise first.

5

When to Use

Strengths

Fast and scalable (O(nkd) per iteration)
Simple and interpretable
Works well with compact spherical clusters
Good starting point for exploration

Limitations

Must specify K in advance
Assumes spherical clusters
Sensitive to outliers (they shift centroids)
Non-deterministic without fixed seed

Customer segmentation Document clustering Image compression Anomaly detection (pre-step)

Key Takeaways

Assign, Then Update

K-Means alternates between two simple steps until convergence. Each step provably decreases the objective — convergence is guaranteed.

Use Silhouette to Validate K

Silhouette score near +1 means well-separated clusters. Plot it vs K alongside the elbow curve before deciding.

DBSCAN for Arbitrary Shapes

When clusters aren't spherical, DBSCAN finds density-based clusters and labels outliers — no K needed.