Evaluating results

Because clustering is unsupervised, no ground truth is available to verify results. The absence of truth complicates assessments of quality. Moreover, real-world datasets typically don't offer obvious clusters of examples as in the example shown in Figure 1.

A graph showing three clear groups of data points — **Figure 1: An ideal data plot. Real-world data rarely looks like this.**

Instead, real-world data often looks more like Figure 2, making it difficult to visually assess clustering quality.

A graph with a random data points — **Figure 2: A more realistic data plot**

However, there are heuristics and best practices that you can iteratively apply to improve the quality of your clustering. The following flowchart gives an overview of how to evaluate your clustering results. We'll expand upon each step.

Flowchart visual of the verification process — **Click here to see a larger version of this chart.**

Step 1: Assess quality of clustering

First check that the clusters look as you'd expect, and that examples you consider similar to each other appear in the same cluster.

Then check these commonly-used metrics (not an exhaustive list):

Cluster cardinality
Cluster magnitude
Downstream performance

Cluster cardinality

Cluster cardinality is the number of examples per cluster. Plot the cluster cardinality for all clusters and investigate clusters that are major outliers. In Figure 2, this would be cluster 5.

A barchart showing the cardinality
of several clusters. Cluster 5 is smaller than the rest. — **Figure 2: Cardinality of several clusters.**

Cluster magnitude

Cluster magnitude is the sum of distances from all examples in a cluster to the cluster's centroid. Plot cluster magnitude for all clusters and investigate outliers. In Figure 3, cluster 0 is an outlier.

Also consider looking at the maximum or average distance of examples from centroids, by cluster, to find outliers.

A barchart showing the magnitude of
several clusters. Cluster 0 is much larger than than the others. — **Figure 3: Magnitude of several clusters.**

Magnitude versus cardinality

You may have noticed that a higher cluster cardinality corresponds to a higher cluster magnitude, which makes intuitive sense, since the more points in a cluster (cardinality), the greater the probable sum of the distances of those points from the centroid (magnitude). You can also identify anomalous clusters by looking for ones where this relationship between cardinality and magnitude is very different than for other clusters. In Figure 4, fitting a line to the plot of cardinality and magnitude suggests that cluster 0 is anomalous. (Cluster 5 is also far from the line, but if cluster 0 were omitted, the new fitted line would be much closer to cluster 5.)

A scatter plot showing
the cardinality versus magnitude for several clusters. One
cluster is an outlier on the plot. — **Figure 4: Cardinality vs. magnitude for the previously shown clusters.**

Downstream performance

Since clustering outputs are often used in downstream ML systems, see if downstream model performance improves when your clustering process changes. This offers a real-world evaluation of the quality of your clustering results, although it can be complex and expensive to conduct this kind of test.

Step 2: Reassess your similarity measure

Your clustering algorithm is only as good as your similarity measure. Make sure your similarity measure returns sensible results. A quick check is to identify pairs of examples known to be more or less similar. Calculate the similarity measure for each pair of examples, and compare your results to your knowledge: pairs of similar examples should have a higher similarity measure than pairs of dissimilar examples.

The examples you use to spot-check your similarity measure should be representative of the dataset, so you can be confident that your similarity measure holds for all your examples. The performance of your similarity measure, whether manual or supervised, must be consistent across your dataset. If your similarity measure is inconsistent for some examples, those examples won't be clustered with similar examples.

If you find examples with inaccurate similarity scores, then your similarity measure probably doesn't fully capture the feature data that distinguishes those examples. Experiment with your similarity measure until it returns more accurate and consistent results.

Step 3: Find the optimal number of clusters

k-means requires you to decide the number of clusters \(k\) beforehand. How do you determine an optimal \(k\)? Try running the algorithm with increasing values of \(k\) and note the sum of all cluster magnitudes. As \(k\) increases, clusters become smaller, and the total distance of points from centroids decreases. We can treat this total distance as a loss. Plot this distance against the number of clusters.

As shown in Figure 5, above a certain \(k\), the reduction in loss becomes marginal with increasing \(k\). Consider using the \(k\) where the slope first has a drastic change, which is called the elbow method. For the plot shown, the optimal \(k\) is approximately 11. If you prefer more granular clusters, you can choose a higher \(k\), consulting this plot.

A graph showing the loss
versus clusters used. Loss decreases as the number of clusters increases until
it levels out around 10 clusters — **Figure 5: Loss versus number of clusters**

Troubleshooting questions

If you discover issues in the course of your evaluation, reassess your data preparation steps and chosen similarity measure. Ask:

Is your data appropriately scaled?
Is your similarity measure correct?
Is your algorithm performing semantically meaningful operations on the data?
Do your algorithm's assumptions match the data?

Manual similarity measure

Advantages and disadvantages