Supervised similarity measure

Instead of comparing manually-combined feature data, you can reduce the feature data to representations called embeddings, then compare the embeddings. Embeddings are generated by training a supervised deep neural network (DNN) on the feature data itself. The embeddings map the feature data to a vector in an embedding space with typically fewer dimensions than the feature data. Embeddings are discussed in the Embeddings module of Machine Learning Crash Course, while neural nets are discussed in the Neural nets module. Embedding vectors for similar examples, such as YouTube videos on similar topics watched by the same users, end up close together in the embedding space. A supervised similarity measure uses this "closeness" to quantify the similarity for pairs of examples.

Remember, we're discussing supervised learning only to create our similarity measure. The similarity measure, whether manual or supervised, is then used by an algorithm to perform unsupervised clustering.

Comparison of Manual and Supervised Measures

This table describes when to use a manual or supervised similarity measure depending on your requirements.

RequirementManualSupervised
Eliminates redundant information in correlated features? No, you need to investigate any correlations between features. Yes, DNN eliminates redundant information.
Gives insight into calculated similarities? Yes No, embeddings cannot be deciphered.
Suitable for small datasets with few features? Yes. No, small datasets don't provide enough training data for a DNN.
Suitable for large datasets with many features? No, manually eliminating redundant information from multiple features and then combining them is very difficult. Yes, the DNN automatically eliminates redundant information and combines features.

Creating a supervised similarity measure

Here's an overview of the process to create a supervised similarity measure:

Input feature data. Choose DNN: autoencoder or predictor.
      Extract embeddings. Choose measurement: Dot product, cosine, or
      Euclidean distance.
Figure 1: Steps to create a supervised similarity measure.

This page discusses DNNs, while the following pages cover the remaining steps.

Choose DNN based on training labels

Reduce your feature data to lower-dimensional embeddings by training a DNN that uses the same feature data both as input and as the labels. For example, in the case of house data, the DNN would use the features—such as price, size, and postal code—to predict those features themselves.

Autoencoder

A DNN that learns embeddings of input data by predicting the input data itself is called an autoencoder. Because an autoencoder's hidden layers are smaller than the input and output layers, the autoencoder is forced to learn a compressed representation of the input feature data. Once the DNN is trained, extract the embeddings from the smallest hidden layer to calculate similarity.

A figure showing a large number of nodes for the identical
       input and output data, which is compressed to three nodes in the middle.
       of five hidden layers.
Figure 2: Autoencoder architecture.

Predictor

An autoencoder is the simplest choice to generate embeddings. However, an autoencoder isn't the optimal choice when certain features could be more important than others in determining similarity. For example, in house data, assume price is more important than postal code. In such cases, use only the important feature as the training label for the DNN. Since this DNN predicts a specific input feature instead of predicting all input features, it is called a predictor DNN. Embeddings should usually be extracted from the last embedding layer.

A figure showing the large number of nodes in the input vector
       being reduced over three hidden layers to a three-node layer from which
       embeddings should be extracted. The last output layer is the predicted
       label value.
Figure 3: Predictor architecture.

When choosing a feature to be the label:

  • Prefer numerical to categorical features because loss is easier to calculate and interpret for numeric features.

  • Remove the feature that you use as the label from the input to the DNN, or else the DNN will use that feature to perfectly predict the output. (This is an extreme example of label leakage.)

Depending on your choice of labels, the resulting DNN is either an autoencoder or a predictor.