This section focuses on two common techniques to get an embedding:
- Dimensionality reduction
- Extracting an embedding from a larger neural net model
Dimensionality reduction techniques
There exist many mathematical techniques for capturing the important structure of a high-dimensional space in a low-dimensional space. In theory, any of these techniques can be used to create an embedding for a machine learning system.
For example, principal component analysis (PCA) has been used to create word embeddings. Given a set of instances like bag of words vectors, PCA tries to find highly correlated dimensions that can be collapsed into a single dimension.
Training an embedding as part of a neural network
You can create an embedding while training a neural network for your target task. This approach gets you an embedding well customized for your particular system, but may take longer than training the embedding separately.
In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space. This embedding layer can be combined with any other features and hidden layers. As in any deep neural network, the parameters will be optimized during training to minimize loss on the nodes in the network's output layer.
Returning to our food recommendation example, our goal is to predict new meals a user will like based on their current favorite meals. First, we can compile additional data on our users' top five favorite foods. Then, we can model this task as a supervised learning problem. We set four of these top five foods to be feature data, and then randomly set aside the fifth food as the positive label that our model aims to predict, optimizing the model's predictions using a softmax loss.
During training, the neural network model will learn the optimal weights for the nodes in the first hidden layer, which serves as the embedding layer. For example, if the model contains three nodes in the first hidden layer, it might determine that the three most relevant dimensions of food items are sandwichness, dessertness, and liquidness. Figure 12 shows the one-hot encoded input value for "hot dog" transformed into a three-dimensional vector.
In the course of training, the weights of the embedding layer will be optimized so that the embedding vectors for similar examples are closer to each other. The individual dimensions of the embedding layer (what each node in the embedding layer represents) are rarely as understandable as "dessertness" or "liquidness." Sometimes what they "mean" can be inferred, but this is not always the case.
Embeddings will usually be specific to the task, and will differ from each other when the task differs. For example, the embeddings generated by a vegetarian vs. non-vegetarian classification model might have two dimensions: meat content and dairy content. Meanwhile, the embeddings generated by a breakfast vs. dinner classifier for American cuisine might have slightly different dimensions: calorie content, grain content, and meat content. "Cereal" and "egg and bacon sandwich" might be close together in the embedding space of a breakfast vs. dinner classifier but far apart in the embedding space of a vegetarian vs. non-vegetarian classifier.
Training a word embedding
In the previous section, you explored a visualization of semantic relationships in the word2vec embedding space.
Word2vec is one of many algorithms used for training word embeddings. It relies on the distributional hypothesis to map semantically similar words to geometrically close embedding vectors. The distributional hypothesis states that words which often have the same neighboring words tend to be semantically similar. Both "dog" and "cat" frequently appear close to the word "veterinarian," and this fact reflects their semantic similarity. As the linguist John Firth put it in 1957, "You shall know a word by the company it keeps."
The following video explains another method of creating a word embedding as part of the process of training a neural network, using a simpler model:
Static vs. contextual embeddings
One limitation of word embeddings like the one discussed in the video above is that they are static. Each word is represented by a single point in vector space, even though it may have a variety of different meanings, depending on how it is used in a sentence. In the last exercise, you discovered the difficulty of mapping semantic similarities for the word orange, which can signify either a color or a type of fruit.
Contextual embeddings were developed to address these shortcomings. Contextual embeddings allow for multiple representations of the same word, each incorporating information about the context in which the word is used. In a contextual embedding, the word orange might have two separate representations: one capturing the "color" usage of the word, as in sentences like "My favorite sweater has orange stripes," and one capturing the "fruit" usage of the word, as in sentences like "The orange was plucked from the tree before it had fully ripened."