[null,null,["最后更新时间 (UTC):2025-05-20。"],[[["\u003cp\u003eEmbeddings can be created using dimensionality reduction techniques like PCA or by training them as part of a neural network.\u003c/p\u003e\n"],["\u003cp\u003eTraining an embedding within a neural network allows customization for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space.\u003c/p\u003e\n"],["\u003cp\u003eWord embeddings, like word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors.\u003c/p\u003e\n"],["\u003cp\u003eStatic word embeddings have limitations as they assign a single representation per word, while contextual embeddings offer multiple representations based on context.\u003c/p\u003e\n"]]],[],null,["# Embeddings: Obtaining embeddings\n\nThis section covers several means of obtaining embeddings, as well as how\nto transform static embeddings into contextual embeddings.\n\nDimensionality reduction techniques\n-----------------------------------\n\nThere are many mathematical techniques that capture the important\nstructures of a high-dimensional space in a low-dimensional space. In theory,\nany of these techniques can be used to create an embedding for a machine\nlearning system.\n\nFor example, [principal component analysis](https://wikipedia.org/wiki/Principal_component_analysis) (PCA)\nhas been used to create word embeddings. Given a set of instances like\n[**bag of words**](/machine-learning/glossary#bag-of-words) vectors, PCA tries\nto find highly correlated dimensions that can be collapsed into a single\ndimension.\n\nTraining an embedding as part of a neural network\n-------------------------------------------------\n\nYou can create an embedding while training a\n[**neural network**](/machine-learning/crash-course/neural-networks) for\nyour target task. This approach gets you an embedding well customized for your\nparticular system, but may take longer than training the embedding separately.\n\nIn general, you can create a hidden layer of size *d* in your neural\nnetwork that is designated as the\n[**embedding layer**](/machine-learning/glossary#embedding-layer), where *d*\nrepresents both the number of nodes in the hidden layer and the number\nof dimensions in the embedding space. This embedding layer can be combined with\nany other features and hidden layers. As in any deep neural network, the\nparameters will be optimized during training to minimize loss on the nodes in\nthe network's output layer.\n\nReturning to our [food recommendation example](/machine-learning/crash-course/embeddings), our goal is\nto predict new meals a user will like based on their current favorite\nmeals. First, we can compile additional data on our users' top five favorite\nfoods. Then, we can model this task as a supervised learning problem. We set\nfour of these top five foods to be feature data, and then randomly set aside the\nfifth food as the positive label that our model aims to predict, optimizing the\nmodel's predictions using a [**softmax**](/machine-learning/glossary#softmax)\nloss.\n\nDuring training, the neural network model will learn the optimal weights for\nthe nodes in the first hidden layer, which serves as the embedding layer.\nFor example, if the model contains three nodes in the first hidden layer,\nit might determine that the three most relevant dimensions of food items are\nsandwichness, dessertness, and liquidness. Figure 12 shows the one-hot encoded\ninput value for \"hot dog\" transformed into a three-dimensional vector.\n**Figure 12.** A one-hot encoding of `hot dog` provided as input to a deep neural network. An embedding layer translates the one-hot encoding into the three-dimensional embedding vector `[2.98, -0.75, 0]`.\n\nIn the course of training, the weights of the embedding layer will be optimized\nso that the [**embedding vectors**](/machine-learning/glossary#embedding-vector)\nfor similar examples are closer to each other. As previously mentioned, the\ndimensions that an actual model chooses for its embeddings are unlikely to be\nas intuitive or understandable as in this example.\n\nContextual embeddings\n---------------------\n\nOne limitation of `word2vec` static embedding vectors is that words can mean\ndifferent things in different contexts. \"Yeah\" means one thing on its own,\nbut the opposite in the phrase \"Yeah, right.\" \"Post\" can mean \"mail,\"\n\"to put in the mail,\" \"earring backing,\" \"marker at the end of a horse race,\"\n\"postproduction,\" \"pillar,\" \"to put up a notice,\" \"to station a guard or\nsoldier,\" or \"after,\" among other possibilities.\n\nHowever, with static embeddings, each word is represented by a single point\nin vector space, even though it may have a variety of meanings.\nIn the [last exercise](/machine-learning/crash-course/embeddings/embedding-space#exercise),\nyou discovered the limitations of static embeddings for the word\n*orange,* which can signify either a color or a type of fruit. With only one\nstatic embedding, *orange* will always be closer to other colors than to\n*juice* when trained on the `word2vec` dataset.\n\n**Contextual embeddings** were developed to address this limitation.\nContextual embeddings allow a word to be represented by multiple embeddings\nthat incorporate information about the surrounding words as well as the\nword itself. *Orange* would have a different embedding for every unique\nsentence containing the word in the dataset.\n\nSome methods for creating contextual embeddings, like\n[ELMo](https://wikipedia.org/wiki/ELMo), take the static\nembedding of an example, such as the `word2vec` vector for a word in a sentence,\nand transform it by a function that incorporates information about the words\naround it. This produces a contextual embedding. \n**Click here for details on contextual embeddings**\n\n\n- For ELMo models specifically, the static embedding is aggregated with embeddings taken from other layers, which encode front-to-back and back-to-front readings of the sentence.\n- [BERT](/machine-learning/glossary#bert-bidirectional-encoder-representations-from-transformers) models mask part of the sequence that the model takes as input.\n- Transformer models use a [self-attention](/machine-learning/glossary#self-attention-also-called-self-attention-layer) layer to weight the relevance of the other words in a sequence to each individual word. They also add the relevant column from a **positional\n embedding matrix** (see [positional encoding](/machine-learning/glossary#positional-encoding)) to each previously learned token embedding, element by element, to produce the input embedding that is fed into the rest of the model for inference. This **input embedding**, unique to each distinct textual sequence, is a contextual embedding.\n\n\u003cbr /\u003e\n\n| **NOTE:** See the [LLM module](/machine-learning/ml-crash-course/llm/whats-an-llm#whats_a_transformer) for more details on transformers and encoder-decoder architecture.\n\nWhile the models described above are language models, contextual embeddings\nare useful in other generative tasks, like images. An embedding of the pixel\nRGB values in a photo of a horse provides more information to the model\nwhen combined with a positional matrix representing each pixel and some\nencoding of the neighboring pixels, creating contextual embeddings, than the\noriginal static embeddings of the RGB values alone.\n| **Key terms:**\n|\n| - [Bag of words](/machine-learning/glossary#bag-of-words)\n| - [BERT](/machine-learning/glossary#bert-bidirectional-encoder-representations-from-transformers)\n| - [Embedding layer](/machine-learning/glossary#embedding-layer)\n| - [Embedding vector](/machine-learning/glossary#embedding-vector)\n| - [Neural network](/machine-learning/glossary#neural-network)\n| - [Positional encoding](/machine-learning/glossary#positional-encoding)\n- [Softmax](/machine-learning/glossary#softmax) \n[Help Center](https://support.google.com/machinelearningeducation)"]]