Advantages and disadvantages of k-means

K-means is useful and efficient in many machine learning contexts, but has some distinct weaknesses.

Advantages of k-means

Relatively simple to implement.

Scales to large data sets.

Always converges.

Allows warm-starting the positions of centroids.

Smoothly adapts to new examples.

Can be generalized to clusters of different shapes and sizes, such as elliptical clusters.

Generalizing k-means

A straightforward implementation of k-means can struggle with clusters of different densities and sizes. The left side of Figure 1 shows the clusters we'd expect to see, while the right side shows the clusters proposed by k-means.

Two graphs side-by-side. The first showing a dataset with somewhat obvious clusters. The second showing an odd grouping of examples after running k-means. — **Figure 1: Ungeneralized k-means example.**

For better performance on imbalanced clusters like the ones shown in Figure 1, you can generalize, that is, adapt, k-means. Figure 2 shows three different datasets clustered with two different generalizations. The first dataset shows k-means without generalization, while the second and third allow for clusters to vary in width.

Three graphs showing k-means without generalization, then k-means
allowing for varying widths, then k-means allowing for varying widths
across dimensions. — **Figure 2: k-means clustering with and without generalization.**

This course doesn't cover how to generalize k-means, but those interested should see Clustering – k-means Gaussian mixture models by Carlos Guestrin from Carnegie Mellon University.

Disadvantages of k-means

\(k\) must be chosen manually.

Results depend on initial values.

For low \(k\), you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As \(k\) increases, you need k-means seeding to pick better initial centroids For a full discussion of k-means seeding, see "A Comparative Study of Efficient Initialization Methods for the K-means Clustering Algorithm," by M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela.

Difficulty clustering data of varying sizes and densities without generalization.

Difficulty clustering outliers.

Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.

Difficulty scaling with number of dimensions.

As the number of dimensions in the data increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by using PCA on the feature data or by using spectral clustering to modify the clustering algorithm.

Curse of dimensionality and spectral clustering

In these three plots, notice how, as dimensions increase, the standard deviation in distance between examples shrinks relative to the mean distance between examples. This convergence means that k-means becomes less effective at distinguishing between examples as the dimensionality of the data increases. This is referred to as the curse of dimensionality.

Three plots that show how the standard deviation of distance between examples decreases as the number of dimensions increases — **Figure 3: A demonstration of the curse of dimensionality. Each plot shows the pairwise distances between 200 random points.**

You can avoid this diminishment in performance with spectral clustering, which adds pre-clustering steps to the algorithm. To perform spectral clustering:

Reduce the dimensionality of feature data by using PCA.
Project all data points into the lower-dimensional subspace.
Cluster the data in this subspace using your chosen algorithm.

See A Tutorial on Spectral Clustering by Ulrike von Luxburg for more information on spectral clustering.

Evaluating results

Autoencoders, predictors, and embeddings