[null,null,["最后更新时间 (UTC):2024-07-26。"],[[["\u003cp\u003eDeep Neural Networks (DNNs) for recommendation address limitations of matrix factorization by incorporating side features and improving relevance.\u003c/p\u003e\n"],["\u003cp\u003eSoftmax DNN treats recommendation as a multiclass prediction problem, predicting the probability of user interaction with each item.\u003c/p\u003e\n"],["\u003cp\u003eDNNs learn embeddings for both queries and items, using a nonlinear function to map features to embeddings.\u003c/p\u003e\n"],["\u003cp\u003eTwo-tower neural networks further enhance DNN models by using separate networks to learn embeddings for queries and items based on their features, enabling the use of item features for improved recommendations.\u003c/p\u003e\n"]]],[],null,["# Deep neural network models\n\nThe previous section showed you how to use matrix factorization to\nlearn embeddings. Some limitations of matrix factorization include:\n\n- The difficulty of using side features (that is, any features beyond the query ID/item ID). As a result, the model can only be queried with a user or item present in the training set.\n- Relevance of recommendations. Popular items tend to be recommended for everyone, especially when using dot product as a similarity measure. It is better to capture specific user interests.\n\nDeep neural network (DNN) models can address these limitations of matrix\nfactorization. DNNs can easily incorporate query features and item features\n(due to the flexibility of the input layer of the network), which can help\ncapture the specific interests of a user and improve the relevance of\nrecommendations.\n\nSoftmax DNN for recommendation\n------------------------------\n\nOne possible DNN model is [softmax](/machine-learning/glossary#softmax),\nwhich treats the problem as a multiclass prediction problem in which:\n\n- The input is the user query.\n- The output is a probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item; for example, the probability to click on or watch a YouTube video.\n\n### Input\n\nThe input to a DNN can include:\n\n- dense features (for example, watch time and time since last watch)\n- sparse features (for example, watch history and country)\n\nUnlike the matrix factorization approach, you can add side features such as\nage or country. We'll denote the input vector by x.\n**Figure 1. The input layer, x.**\n\n### Model architecture\n\nThe model architecture determines the complexity and expressivity of the model.\nBy adding hidden layers and non-linear activation functions (for example, ReLU),\nthe model can capture more complex relationships in the data. However,\nincreasing the number of parameters also typically makes the model harder to\ntrain and more expensive to serve. We will denote the output of the last hidden\nlayer by \\\\(\\\\psi (x) \\\\in \\\\mathbb R\\^d\\\\).\n**Figure 2. The output of the hidden layers, \\\\(\\\\psi (x)\\\\).**\n\n### Softmax Output: Predicted Probability Distribution\n\nThe model maps the output of the last layer, \\\\(\\\\psi (x)\\\\), through a softmax\nlayer to a probability distribution \\\\(\\\\hat p = h(\\\\psi(x) V\\^T)\\\\), where:\n\n- \\\\(h : \\\\mathbb R\\^n \\\\to \\\\mathbb R\\^n\\\\) is the softmax function, given by \\\\(h(y)_i=\\\\frac{e\\^{y_i}}{\\\\sum_j e\\^{y_j}}\\\\)\n- \\\\(V \\\\in \\\\mathbb R\\^{n \\\\times d}\\\\) is the matrix of weights of the softmax layer.\n\nThe softmax layer maps a vector of scores \\\\(y \\\\in \\\\mathbb R\\^n\\\\)\n(sometimes called the\n[**logits**](https://developers.google.com/machine-learning/glossary/#logits))\nto a probability distribution.\n**Figure 3. The predicted probability distribution, \\\\(\\\\hat p = h(\\\\psi(x) V\\^T)\\\\).** **Did you know?**\n|\n| The name softmax is a play on\n| words. A \"hard\" max assigns probability 1 to the item with the largest score\n| \\\\(y_i\\\\). By contrast, the softmax assigns a non-zero probability to all\n| items, giving a higher probability to items that have higher scores. When the\n| scores are scaled, the softmax \\\\(h(\\\\alpha y)\\\\) converges to a \"hard\" max in\n| the limit \\\\(\\\\alpha \\\\to \\\\infty\\\\).\n\n### Loss Function\n\nFinally, define a loss function that compares the following:\n\n- \\\\(\\\\hat p\\\\), the output of the softmax layer (a probability distribution)\n- \\\\(p\\\\), the ground truth, representing the items the user has interacted with (for example, YouTube videos the user clicked or watched). This can be represented as a normalized multi-hot distribution (a probability vector).\n\nFor example, you can use the cross-entropy loss since you are comparing\ntwo probability distributions.\n**Figure 4. The loss function.**\n\n### Softmax Embeddings\n\nThe probability of item \\\\(j\\\\) is given by\n\\\\(\\\\hat p_j = \\\\frac{\\\\exp(\\\\langle \\\\psi(x), V_j\\\\rangle)}{Z}\\\\),\nwhere \\\\(Z\\\\) is a normalization constant that does not depend on \\\\(j\\\\).\n\nIn other words, \\\\(\\\\log(\\\\hat p_j) = \\\\langle \\\\psi(x), V_j\\\\rangle - log(Z)\\\\),\nso the log probability of an item \\\\(j\\\\) is (up to an additive constant)\nthe dot product of two \\\\(d\\\\)-dimensional vectors, which can be interpreted\nas query and item embeddings:\n\n- \\\\(\\\\psi(x) \\\\in \\\\mathbb R\\^d\\\\) is the output of the last hidden layer. We call it the embedding of the query \\\\(x\\\\).\n- \\\\(V_j \\\\in \\\\mathbb R\\^d\\\\) is the vector of weights connecting the last hidden layer to output j. We call it the embedding of item \\\\(j\\\\).\n\n| **Note:** Since \\\\(\\\\log\\\\) is an increasing function, items \\\\(j\\\\) with the highest probability \\\\(\\\\hat p_j\\\\) are the items with the highest dot product \\\\(\\\\langle \\\\psi(x) , V_j\\\\rangle\\\\). Therefore, the dot product can be interpreted as a similarity measure in this embedding space.\n**Figure 5. Embedding of item \\\\(j\\\\), \\\\(V_j \\\\in \\\\mathbb R\\^d\\\\)**\n\nDNN and Matrix Factorization\n----------------------------\n\nIn both the softmax model and the matrix factorization model,\nthe system learns one embedding vector\n\\\\(V_j\\\\) per item \\\\(j\\\\). What we called the\n*item embedding matrix* \\\\(V \\\\in \\\\mathbb R\\^{n \\\\times d}\\\\) in matrix\nfactorization is now the matrix of weights of the softmax layer.\n\nThe query embeddings, however, are different. Instead of learning\none embedding \\\\(U_i\\\\) per query \\\\(i\\\\), the system learns a mapping\nfrom the query feature \\\\(x\\\\) to an embedding \\\\(\\\\psi(x) \\\\in \\\\mathbb R\\^d\\\\).\nTherefore, you can think of this DNN model as a generalization of matrix\nfactorization, in which you replace the query side by a nonlinear\nfunction \\\\(\\\\psi(\\\\cdot)\\\\).\n\n### Can You Use Item Features?\n\nCan you apply the same idea to the item side? That is, instead of learning\none embedding per item, can the model learn a nonlinear function that maps\nitem features to an embedding? Yes. To do so, use a two-tower\nneural network, which consists of two neural networks:\n\n- One neural network maps query features \\\\(x_{\\\\text{query}}\\\\) to query embedding \\\\(\\\\psi(x_{\\\\text{query}}) \\\\in \\\\mathbb R\\^d\\\\)\n- One neural network maps item features \\\\(x_{\\\\text{item}}\\\\) to item embedding \\\\(\\\\phi(x_{\\\\text{item}}) \\\\in \\\\mathbb R\\^d\\\\)\n\nThe output of the model can be defined as the dot product of\n\\\\(\\\\langle \\\\psi(x_{\\\\text{query}}), \\\\phi(x_{\\\\text{item}}) \\\\rangle\\\\).\nNote that this is not a softmax model anymore. The new model predicts\none value per pair \\\\((x_{\\\\text{query}}, x_{\\\\text{item}})\\\\)\ninstead of a probability vector for each query \\\\(x_{\\\\text{query}}\\\\)."]]