Clustering — unsupervised discovery of structure in unlabelled data
An e-commerce company has 1 million customers but no labels. Can we find natural groups of similar customers — high-value loyal buyers, one-time discount hunters — without anyone telling us the groups exist?
K-Means: place K centroids randomly, assign each point to nearest centroid, move each centroid to the mean of its points. Repeat until nothing moves. Simple, fast, effective.
J = Σₖ Σₓ∈Cₖ ‖x − μₖ‖²s = (b − a) / max(a, b)a = mean intra-cluster distance, b = mean nearest-cluster distance
Plot inertia (J) vs K. The "elbow" — where the curve bends — is the optimal K.
K-Means alternates between two simple steps until convergence. Each step provably decreases the objective — convergence is guaranteed.
Silhouette score near +1 means well-separated clusters. Plot it vs K alongside the elbow curve before deciding.
When clusters aren't spherical, DBSCAN finds density-based clusters and labels outliers — no K needed.