Manual similarity measure

As just shown, k-means assigns points to their closest centroid. But what does "closest" mean?

To apply k-means to feature data, you will need to define a measure of similarity that combines all the feature data into a single numeric value, called a manual similarity measure.

Consider a shoe dataset. If that dataset has shoe size as its only feature, you can define the similarity of two shoes in terms of the difference between their sizes. The smaller the numerical difference between sizes, the greater the similarity between shoes.

If that shoe dataset had two numeric features, size and price, you can combine them into a single number representing similarity. First scale the data so both features are comparable:

  • Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.
  • Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data to quantiles and scale to \([0,1]\).

Next, combine the two features by calculating the root mean squared error (RMSE). This rough measure of similarity is given by \(\sqrt{\frac{(s_i - s_j)^2+(p_i - p_j)^2}{2}}\).

For a simple example, calculate similarity for two shoes with US sizes 8 and 11, and prices 120 and 150. Since we don't have enough data to understand the distribution, we'll scale the data without normalizing or using quantiles.

ActionMethod
Scale the size. Assume a maximum possible shoe size of 20. Divide 8 and 11 by the maximum size 20 to get 0.4 and 0.55.
Scale the price. Divide 120 and 150 by the maximum price 150 to get 0.8 and 1.
Find the difference in size. \(0.55 - 0.4 = 0.15\)
Find the difference in price. \(1 - 0.8 = 0.2\)
Calculate the RMSE. \(\sqrt{\frac{0.2^2+0.15^2}{2}} = 0.17\)

Intuitively, your similarity measure should increase when feature data is more similar. Instead, your similarity measure (RMSE) actually decreases. Make your similarity measure follow your intuition by subtracting it from 1.

\[\text{Similarity} = 1 - 0.17 = 0.83\]

In general, you can prepare numerical data as described in Prepare data, then combine the data by using Euclidean distance.

What if that dataset included both shoe size and shoe color? Color is categorical data, discussed in Machine Learning Crash Course in Working with categorical data. Categorical data is harder to combine with the numerical size data. It can be:

  • Single-valued (univalent), such as a car's color ("white" or "blue" but never both)
  • Multi-valued (multivalent), such as a movie's genre (a movie can be both "action" and "comedy," or only "action")

If univalent data matches, for example in the case of two pairs of blue shoes, the similarity between the examples is 1. Otherwise, similarity is 0.

Multivalent data, like movie genres, is harder to work with. If there are a fixed set of movie genres, similarity can be calculated using the ratio of common values, called Jaccard similarity. Example calculations of Jaccard similarity:

  • [“comedy”,”action”] and [“comedy”,”action”] = 1
  • [“comedy”,”action”] and [“action”] = ½
  • [“comedy”,”action”] and [“action”, "drama"] = ⅓
  • [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0

Jaccard similarity is not the only possible manual similarity measure for categorical data. Two other examples:

  • Postal codes can be converted into latitude and longitude before calculating Euclidean distance between them.
  • Color can be converted into numeric RGB values, with differences in values combined into Euclidean distance.

See Working with categorical data for more.

In general, a manual similarity measure must directly correspond to actual similarity. If your chosen metric does not, then it isn't encoding the information you want it to encode.

Pre-process your data carefully before calculating a similarity measure. The examples on this page are simplified. Most real-world datasets are large and complex. As previously mentioned, quantiles are a good default choice for processing numeric data.

As the complexity of data increases, it becomes harder to create a manual similarity measure. In that situation, switch to a supervised similarity measure, where a supervised machine learning model calculates similarity. This will be discussed in more detail later.