手動相似度測量
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
如剛才所示,k 均值會將點指派給最近的中心點。但「最接近」是什麼意思?
如要將 k 均值套用至特徵資料,您必須定義相似度評估方式,將所有特徵資料合併為單一數值,稱為手動相似度評估方式。
請考慮鞋子資料集。如果該資料集的唯一特徵是鞋碼,您可以根據兩雙鞋子的尺寸差異,定義兩者之間的相似程度。尺碼差異的數值越小,鞋子之間的相似度就越高。
如果鞋子資料集包含兩個數值特徵 (尺寸和價格),您可以將這兩個特徵組合成一個代表相似度的數字。首先請將資料進行調整,以便比較兩個地圖項目:
- 尺寸 (s):鞋子尺寸可能會形成高斯分布。請確認這項資訊。然後將資料正規化。
- Price (p):資料可能是卜瓦松分佈。請確認這項資訊。如果您有足夠的資料,請將資料轉換為分位數,並將比例縮放至 \([0,1]\)。
接著,計算均方根誤差 (RMSE),將這兩項特徵結合。\(\sqrt{\frac{(s_i - s_j)^2+(p_i - p_j)^2}{2}}\)會提供這個粗略的相似度評估值。
舉例來說,假設有兩雙鞋子,分別是美國尺寸 8 和 11,價格分別是 120 和 150,由於我們沒有足夠的資料來瞭解分布情形,因此我們會在不進行標準化或使用分位數的情況下,縮放資料。
動作 | 方法 |
調整大小。 |
假設鞋子尺寸上限為 20。將 8 和 11 除以最大大小 20,即可得到 0.4 和 0.55。 |
調整價格。 |
將 120 和 150 除以最高價格 150,即可得到 0.8 和 1。 |
找出大小差異。 |
\(0.55 - 0.4 = 0.15\) |
找出價格差異。 |
\(1 - 0.8 = 0.2\) |
計算 RMSE。 |
\(\sqrt{\frac{0.2^2+0.15^2}{2}} = 0.17\) |
直覺來說,當特徵資料越相似,相似度評估值就會越高。相反地,相似度評估值 (RMSE) 實際上會降低。將相似度評估結果從 1 中減去,讓結果符合您的直覺。
\[\text{Similarity} = 1 - 0.17 = 0.83\]
一般來說,您可以按照「準備資料」一節的說明準備數值資料,然後使用歐幾里德距離結合資料。
如果資料集同時包含鞋子尺寸和鞋子顏色,顏色是分類資料,請參閱機器學習速成課程中的「使用分類資料」一節。類別資料較難與數值大小資料結合。可能包括:
- 單一值 (單值),例如汽車顏色 (「白色」或「藍色」,但不會同時是兩者)
- 多值 (多價),例如電影類型 (電影可以同時是「動作」和「喜劇」,或僅為「動作」)
如果單值資料相符,例如兩雙藍色鞋子的情況,範例之間的相似度為 1。否則相似度為 0。
電影類型等多值資料則較難處理。如果有固定的電影類型組合,可以使用共同值的比率計算相似度,稱為 Jaccard 相似度。Jaccard 相似度計算範例:
- [“comedy”,”action”] 和 [“comedy”,”action”] = 1
- [“comedy”,”action”] 和 [“action”] = ½
- [“comedy”,”action”] 和 [“action”, "drama"] = ⅓
- [“comedy”,”action”] 和 [“non-fiction”,”biographical”] = 0
除了雅卡德相似度,您也可以使用其他手動相似度評估方式來評估類別資料。以下提供其他兩個範例:
- 您可以先將郵遞區號轉換為經緯度,再計算兩者之間的歐氏距離。
- 顏色可轉換為數值 RGB 值,並將值差異合併為歐幾里德距離。
詳情請參閱「處理分類資料」。
一般來說,手動相似度評估必須直接對應實際相似度。如果所選指標沒有,就表示該指標並未編碼您要編碼的資訊。
請務必先仔細預先處理資料,再計算相似度評估值。本頁的範例已簡化。大多數實際資料集都龐大且複雜。如前所述,分位數是處理數值資料的理想預設選擇。
隨著資料複雜度增加,手動建立相似度評估方式的難度也會隨之提高。在這種情況下,請改用監督相似度評估方法,讓監督機器學習模型計算相似度。我們稍後會進一步討論這項功能。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-02-25 (世界標準時間)。
[null,null,["上次更新時間:2025-02-25 (世界標準時間)。"],[[["\u003cp\u003eK-means clustering assigns data points to the nearest centroid based on a similarity measure.\u003c/p\u003e\n"],["\u003cp\u003eA manual similarity measure combines all feature data into a single numeric value for comparison.\u003c/p\u003e\n"],["\u003cp\u003eNumerical features can be scaled and combined using RMSE, while categorical features can use Jaccard similarity or other methods.\u003c/p\u003e\n"],["\u003cp\u003eA good similarity measure accurately reflects real-world similarity and often requires careful data preprocessing.\u003c/p\u003e\n"],["\u003cp\u003eFor complex data, a supervised similarity measure using a machine learning model may be more suitable.\u003c/p\u003e\n"]]],[],null,["# Manual similarity measure\n\nAs just shown, k-means assigns points to their closest centroid. But what does\n\"closest\" mean?\n\nTo apply k-means to feature data, you will need to define a measure of\nsimilarity that combines all the feature data into a single numeric value,\ncalled a **manual similarity measure**.\n\nConsider a shoe dataset. If that dataset has shoe size as its only feature,\nyou can define the similarity of two shoes in terms of the difference between\ntheir sizes. The smaller the numerical difference between sizes, the greater the\nsimilarity between shoes.\n\nIf that shoe dataset had two numeric features, size and price, you can combine\nthem into a single number representing similarity. First scale the data so\nboth features are comparable:\n\n- Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.\n- Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data to quantiles and scale to \\\\(\\[0,1\\]\\\\).\n\nNext, combine the two features by calculating the\n[root mean squared error](/machine-learning/glossary#RMSE) (RMSE).\nThis rough measure of similarity is given by\n\\\\(\\\\sqrt{\\\\frac{(s_i - s_j)\\^2+(p_i - p_j)\\^2}{2}}\\\\).\n\nFor a simple example, calculate similarity for two shoes with US sizes\n8 and 11, and prices 120 and 150. Since we don't have enough data to understand\nthe distribution, we'll scale the data without normalizing or using\nquantiles.\n\n| Action | Method |\n|-------------------------------|--------------------------------------------------------------------------------------------------------|\n| Scale the size. | Assume a maximum possible shoe size of 20. Divide 8 and 11 by the maximum size 20 to get 0.4 and 0.55. |\n| Scale the price. | Divide 120 and 150 by the maximum price 150 to get 0.8 and 1. |\n| Find the difference in size. | \\\\(0.55 - 0.4 = 0.15\\\\) |\n| Find the difference in price. | \\\\(1 - 0.8 = 0.2\\\\) |\n| Calculate the RMSE. | \\\\(\\\\sqrt{\\\\frac{0.2\\^2+0.15\\^2}{2}} = 0.17\\\\) |\n\nIntuitively, your similarity measure should increase when feature data is more\nsimilar. Instead, your similarity measure (RMSE) actually decreases. Make your\nsimilarity measure follow your intuition by subtracting it from 1.\n\n\\\\\\[\\\\text{Similarity} = 1 - 0.17 = 0.83\\\\\\]\n\nIn general, you can prepare numerical data as described in\n[Prepare data](/machine-learning/clustering/prepare-data), then combine the\ndata by using Euclidean distance.\n\nWhat if that dataset included both shoe size and shoe color? Color is\n[categorical data](/machine-learning/glossary#categorical_data),\ndiscussed in Machine Learning Crash Course in\n[Working with categorical data](/machine-learning/crash-course/categorical-data).\nCategorical data is harder to combine with the numerical size data. It can be:\n\n- Single-valued (univalent), such as a car's color (\"white\" or \"blue\" but never both)\n- Multi-valued (multivalent), such as a movie's genre (a movie can be both \"action\" and \"comedy,\" or only \"action\")\n\nIf univalent data matches, for example in the case of two pairs of blue shoes,\nthe similarity between the examples is 1. Otherwise, similarity is 0.\n\nMultivalent data, like movie genres, is harder to work with. If there are a\nfixed set of movie genres, similarity can be calculated using the ratio of\ncommon values, called\n[**Jaccard similarity**](https://wikipedia.org/wiki/Jaccard_index). Example\ncalculations of Jaccard similarity:\n\n- \\[\"comedy\",\"action\"\\] and \\[\"comedy\",\"action\"\\] = 1\n- \\[\"comedy\",\"action\"\\] and \\[\"action\"\\] = ½\n- \\[\"comedy\",\"action\"\\] and \\[\"action\", \"drama\"\\] = ⅓\n- \\[\"comedy\",\"action\"\\] and \\[\"non-fiction\",\"biographical\"\\] = 0\n\nJaccard similarity is not the only possible manual similarity measure for\ncategorical data. Two other examples:\n\n- **Postal codes** can be converted into latitude and longitude before calculating Euclidean distance between them.\n- **Color** can be converted into numeric RGB values, with differences in values combined into Euclidean distance.\n\nSee [Working with categorical data](/machine-learning/crash-course/categorical-data)\nfor more.\n\nIn general, a manual similarity measure must directly correspond\nto actual similarity. If your chosen metric does not, then it isn't encoding the\ninformation you want it to encode.\n\nPre-process your data carefully before calculating a similarity measure. The\nexamples on this page are simplified. Most real-world datasets are large\nand complex. As previously mentioned, quantiles are a good default choice\nfor processing numeric data.\n\nAs the complexity of data increases, it becomes harder to create a manual\nsimilarity measure. In that situation, switch to a\n**supervised similarity measure**, where a supervised machine\nlearning model calculates similarity. This will be discussed in more detail\nlater.\n\n| **Key terms:**\n|\n| \u003cbr /\u003e\n|\n| - [categorical data](/machine-learning/glossary#categorical_data)\n| - [Jaccard similarity](https://wikipedia.org/wiki/Jaccard_index)\n|\n| \u003cbr /\u003e\n|\n\u003cbr /\u003e"]]