← Hub
Topic 11 · Phase 3 · Interactive

Find the Natural Groups

Clustering — unsupervised discovery of structure in unlabelled data

1

The Problem

An e-commerce company has 1 million customers but no labels. Can we find natural groups of similar customers — high-value loyal buyers, one-time discount hunters — without anyone telling us the groups exist?

Unsupervised challenge

  • No labels — only features
  • Unknown number of groups
  • No "correct" answer to evaluate against
Use RFM features: Recency, Frequency, Monetary value — classic clustering features for retail segmentation.
"Clustering reveals the hidden architecture of your data — groups that didn't have names until the algorithm found them."
2

The Intuition

K-Means: place K centroids randomly, assign each point to nearest centroid, move each centroid to the mean of its points. Repeat until nothing moves. Simple, fast, effective.

1
Place K centroids at random positions
2
Assign each point to nearest centroid (Euclidean distance)
3
Move each centroid to the mean of its assigned points
3
Repeat 2–3 until centroids stop moving (convergence)
3

The Math

K-Means Objective (minimise within-cluster variance)
J = Σₖ Σₓ∈Cₖ ‖x − μₖ‖²
Silhouette Score (−1 to +1)
s = (b − a) / max(a, b)

a = mean intra-cluster distance, b = mean nearest-cluster distance

Elbow Method for K

Plot inertia (J) vs K. The "elbow" — where the curve bends — is the optimal K.

4

Assumptions & Pitfalls

K-Means assumes spherical clusters of equal size. Works badly on elongated or non-convex clusters. Use DBSCAN or GMM instead.
Sensitive to initialisation. Always use K-Means++ initialisation — it places centroids far apart initially, reducing bad local optima.
Scale your features. K-Means uses Euclidean distance. A feature measured in thousands will dominate. Always standardise first.
5

When to Use

Strengths

  • Fast and scalable (O(nkd) per iteration)
  • Simple and interpretable
  • Works well with compact spherical clusters
  • Good starting point for exploration

Limitations

  • Must specify K in advance
  • Assumes spherical clusters
  • Sensitive to outliers (they shift centroids)
  • Non-deterministic without fixed seed
Customer segmentation Document clustering Image compression Anomaly detection (pre-step)

Key Takeaways

Assign, Then Update

K-Means alternates between two simple steps until convergence. Each step provably decreases the objective — convergence is guaranteed.

Use Silhouette to Validate K

Silhouette score near +1 means well-separated clusters. Plot it vs K alongside the elbow curve before deciding.

DBSCAN for Arbitrary Shapes

When clusters aren't spherical, DBSCAN finds density-based clusters and labels outliers — no K needed.