測量嵌入的相似性
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
您現在已為任何一組範例建立嵌入。受控相似度評估會採用這些嵌入資料,並傳回用於評估相似度的數字。請注意,嵌入是數字向量。如要找出兩個向量 \(A = [a_1,a_2,...,a_n]\) 和 \(B = [b_1,b_2,...,b_n]\)之間的相似度,請選擇下列三種相似度評估方法之一:
測量 | 意義 | 公式 |
隨著相似度增加,這項指標... |
歐幾里得距離 | 向量端點之間的距離 |
\(\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+...+(a_N-b_N)^2}\) |
減少 |
餘弦 | 向量間角度的餘弦值 \(\theta\) |
\(\frac{a^T b}{|a| \cdot |b|}\) |
增加 |
點積 | 餘弦乘以兩個向量的長度 |
\(a_1b_1+a_2b_2+...+a_nb_n\) \(=|a||b|cos(\theta)\) |
增加。也會隨著向量長度增加。 |
選擇相似度評估指標
與餘弦相反,點積與向量長度成正比。這一點很重要,因為在訓練集 (例如熱門 YouTube 影片) 中經常出現的範例,其嵌入向量通常長度較長。
如果您想擷取熱門程度,請選擇 dot product。不過,熱門範例可能會扭曲相似度指標,為平衡這種偏差,您可以將長度提升為指數 \(\alpha\ < 1\) ,以便計算內積為 \(|a|^{\alpha}|b|^{\alpha}\cos(\theta)\)。
為進一步瞭解向量長度如何影響相似度評估,請將向量長度標準化為 1,並注意到這三項評估會彼此成比例。
將 a 和 b 標準化後, \(||a||=1\) 和 \(||b||=1\),這三個指標的關係如下:
- 歐幾里得距離 = \(||a-b|| = \sqrt{||a||^2 + ||b||^2 - 2a^{T}b}
= \sqrt{2-2\cos(\theta_{ab})}\)。
- 內積 = \( |a||b| \cos(\theta_{ab})
= 1\cdot1\cdot \cos(\theta_{ab}) = cos(\theta_{ab})\)。
- 餘弦 = \(\cos(\theta_{ab})\)。
因此,三種相似度評估方式都相等,因為它們都與 \(cos(\theta_{ab})\)成正比。
查看相似度評估
相似度評估指標會根據其他組合的相似度,量化一組範例的相似度。以下比較手動和監督兩種類型:
類型 | 建立方式 | 最適合 | 影響 |
手動 | 手動合併特徵資料。 |
資料集較小,且內含可輕鬆合併的功能。 |
提供相似度計算結果的洞察資料。如果特徵資料有所變更,您必須手動更新相似度評估指標。 |
監督式 | 測量受控 DNN 產生的嵌入項目之間的距離。 |
包含難以合併特徵的大型資料集。 |
無法提供結果洞察。不過,DNN 可自動調整以因應變化的特徵資料。 |
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-02-25 (世界標準時間)。
[null,null,["上次更新時間:2025-02-25 (世界標準時間)。"],[[["\u003cp\u003eSupervised similarity measures leverage embeddings to quantify the similarity between data examples using Euclidean distance, cosine, or dot product.\u003c/p\u003e\n"],["\u003cp\u003eDot product incorporates vector length, reflecting popularity, while cosine similarity focuses solely on the angle between vectors, ignoring popularity.\u003c/p\u003e\n"],["\u003cp\u003eNormalizing vector lengths makes Euclidean distance, cosine, and dot product proportional, essentially measuring the same thing.\u003c/p\u003e\n"],["\u003cp\u003eSupervised similarity, using embeddings and a distance metric, is suitable for large, complex datasets, while manual similarity, relying on feature combinations, is better for small, straightforward datasets.\u003c/p\u003e\n"]]],[],null,["# Measuring similarity from embeddings\n\nYou now have embeddings for any pair of examples. A supervised similarity\nmeasure takes these embeddings and returns a number measuring their similarity.\nRemember that embeddings are vectors of numbers. To find the similarity between\ntwo vectors \\\\(A = \\[a_1,a_2,...,a_n\\]\\\\) and \\\\(B = \\[b_1,b_2,...,b_n\\]\\\\),\nchoose one of these three similarity measures:\n\n| Measure | Meaning | Formula | As similarity increases, this measure... |\n|--------------------|-----------------------------------------------|--------------------------------------------------------------|---------------------------------------------------|\n| Euclidean distance | Distance between ends of vectors | \\\\(\\\\sqrt{(a_1-b_1)\\^2+(a_2-b_2)\\^2+...+(a_N-b_N)\\^2}\\\\) | Decreases |\n| Cosine | Cosine of angle \\\\(\\\\theta\\\\) between vectors | \\\\(\\\\frac{a\\^T b}{\\|a\\| \\\\cdot \\|b\\|}\\\\) | Increases |\n| Dot product | Cosine multiplied by lengths of both vectors | \\\\(a_1b_1+a_2b_2+...+a_nb_n\\\\) \\\\(=\\|a\\|\\|b\\|cos(\\\\theta)\\\\) | Increases. Also increases with length of vectors. |\n\nChoosing a similarity measure\n-----------------------------\n\nIn contrast to the cosine, the dot product is proportional to the vector length.\nThis is important because examples that appear very frequently in the training\nset (for example, popular YouTube videos) tend to have embedding vectors with\nlarge lengths.\n\nIf you\nwant to capture popularity, then choose dot product. However, the risk is that\npopular examples may skew the similarity metric. To balance this skew, you can\nraise the length to an exponent \\\\(\\\\alpha\\\\ \\\u003c 1\\\\) to calculate the dot product\nas \\\\(\\|a\\|\\^{\\\\alpha}\\|b\\|\\^{\\\\alpha}\\\\cos(\\\\theta)\\\\).\n\nTo better understand how vector length changes the similarity measure, normalize\nthe vector lengths to 1 and notice that the three measures become proportional\nto each other. \nProof: Proportionality of Similarity Measures \nAfter normalizing a and b such that \\\\(\\|\\|a\\|\\|=1\\\\) and \\\\(\\|\\|b\\|\\|=1\\\\), these three measures are related as:\n\n- Euclidean distance = \\\\(\\|\\|a-b\\|\\| = \\\\sqrt{\\|\\|a\\|\\|\\^2 + \\|\\|b\\|\\|\\^2 - 2a\\^{T}b} = \\\\sqrt{2-2\\\\cos(\\\\theta_{ab})}\\\\).\n- Dot product = \\\\( \\|a\\|\\|b\\| \\\\cos(\\\\theta_{ab}) = 1\\\\cdot1\\\\cdot \\\\cos(\\\\theta_{ab}) = cos(\\\\theta_{ab})\\\\).\n- Cosine = \\\\(\\\\cos(\\\\theta_{ab})\\\\).\nThus, all three similarity measures are equivalent because they are proportional to \\\\(cos(\\\\theta_{ab})\\\\).\n\nReview of similarity measures\n-----------------------------\n\nA similarity measure quantifies the similarity between a pair of\nexamples, relative to other pairs of examples. The two types, manual and\nsupervised, are compared below:\n\n| Type | How to create | Best for | Implications |\n|------------|--------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|\n| Manual | Manually combine feature data. | Small datasets with features that are straightforward to combine. | Gives insight into results of similarity calculations. If feature data changes, you must manually update the similarity measure. |\n| Supervised | Measure distance between embeddings generated by a supervised DNN. | Large datasets with hard-to-combine features. | Gives no insight into results. However, a DNN can automatically adapt to changing feature data. |"]]