手动相似度衡量
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
如刚才所示,k-means 会将点分配给它们最近的质心。但“最接近”是什么意思?
如需将 k-means 应用于特征数据,您需要定义一种将所有特征数据合并为单个数值的相似度衡量标准,称为手动相似度衡量标准。
考虑一个鞋子数据集。如果该数据集的唯一特征是鞋码,则您可以根据两只鞋的尺寸差异来定义两只鞋的相似性。尺码之间的数值差异越小,鞋子之间的相似度就越高。
如果该鞋子数据集包含两个数值特征(尺码和价格),您可以将它们组合成一个表示相似度的数字。首先,对数据进行缩放,使这两个特征具有可比性:
- 尺码:鞋码可能呈现高斯分布。请确认这一点。
然后对数据进行归一化。
- 价格 (p):数据可能采用泊松分布。请确认这一点。如果您有足够的数据,请将数据转换为百分位数并缩放到 \([0,1]\)。
接下来,通过计算均方根误差 (RMSE) 来组合这两个特征。此粗略的相似度衡量方法由\(\sqrt{\frac{(s_i - s_j)^2+(p_i - p_j)^2}{2}}\)给出。
举个简单的例子,计算两双鞋子的相似度,一双鞋子的美国尺码为 8,另一双鞋子的美国尺码为 11,价格分别为 120 美元和 150 美元。由于我们没有足够的数据来了解分布情况,因此我们将在不归一化或使用百分位数的情况下对数据进行缩放。
操作 | 方法 |
缩放大小。 |
假设鞋码上限为 20。将 8 和 11 除以最大尺寸 20,分别得到 0.4 和 0.55。 |
调整价格。 |
将 120 和 150 除以最高价格 150,分别得出 0.8 和 1。 |
找出尺寸差异。 |
\(0.55 - 0.4 = 0.15\) |
查找价格差异。 |
\(1 - 0.8 = 0.2\) |
计算 RMSE。 |
\(\sqrt{\frac{0.2^2+0.15^2}{2}} = 0.17\) |
直观地讲,特征数据越相似,相似度衡量值就应该越高。相反,相似度衡量值 (RMSE) 实际上会降低。将相似度测量值从 1 中减去,使其符合直觉。
\[\text{Similarity} = 1 - 0.17 = 0.83\]
通常,您可以按照准备数据中所述的方式准备数值数据,然后使用欧几里得距离来合并数据。
如果该数据集同时包含鞋码和鞋子颜色,该怎么办?颜色属于分类数据,如需了解详情,请参阅机器学习速成课程中的处理分类数据部分。分类数据更难与数值大小数据结合使用。它可以是:
- 单值(单值),例如汽车的颜色(“白色”或“蓝色”,但不能同时是这两种颜色)
- 多值(多元),例如电影的类型(一部电影既可以是“动作片”,也可以是“喜剧片”,或者仅是“动作片”)
如果单值数据匹配(例如,两双蓝色鞋子),则示例之间的相似性为 1。否则,相似度为 0。
多值数据(例如电影类型)较难处理。如果电影类型是固定的,则可以使用共同值的比率来计算相似度,称为 Jaccard 相似度。Jaccard 相似性计算示例:
- [“comedy”,”action”] and [“comedy”,”action”] = 1
- [“comedy”,”action”] and [“action”] = ½
- [“comedy”,”action”] and [“action”, "drama"] = ⅓
- [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0
Jaccard 相似度并非对分类数据进行手动相似度衡量的唯一可能方法。下面是另外两个示例:
- 在计算邮政编码之间的欧几里得距离之前,可以将其转换为纬度和经度。
- 颜色可以转换为数值 RGB 值,并将值差异组合为欧几里得距离。
如需了解详情,请参阅处理分类数据。
一般来说,手动相似度衡量标准必须直接与实际相似度相对应。如果您选择的指标不符合上述条件,则表示该指标未编码您希望编码的信息。
请先仔细预处理数据,然后再计算相似度测量值。本页面上的示例进行了简化。大多数现实世界的数据集都非常庞大且复杂。如前所述,对于处理数值数据,四分位数是一个不错的默认选择。
随着数据复杂性的增加,手动创建相似度衡量标准的难度也会增加。在这种情况下,请改用监督式相似度衡量方法,其中监督式机器学习模型会计算相似度。我们稍后会对此进行详细讨论。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-02-25。
[null,null,["最后更新时间 (UTC):2025-02-25。"],[[["\u003cp\u003eK-means clustering assigns data points to the nearest centroid based on a similarity measure.\u003c/p\u003e\n"],["\u003cp\u003eA manual similarity measure combines all feature data into a single numeric value for comparison.\u003c/p\u003e\n"],["\u003cp\u003eNumerical features can be scaled and combined using RMSE, while categorical features can use Jaccard similarity or other methods.\u003c/p\u003e\n"],["\u003cp\u003eA good similarity measure accurately reflects real-world similarity and often requires careful data preprocessing.\u003c/p\u003e\n"],["\u003cp\u003eFor complex data, a supervised similarity measure using a machine learning model may be more suitable.\u003c/p\u003e\n"]]],[],null,["# Manual similarity measure\n\nAs just shown, k-means assigns points to their closest centroid. But what does\n\"closest\" mean?\n\nTo apply k-means to feature data, you will need to define a measure of\nsimilarity that combines all the feature data into a single numeric value,\ncalled a **manual similarity measure**.\n\nConsider a shoe dataset. If that dataset has shoe size as its only feature,\nyou can define the similarity of two shoes in terms of the difference between\ntheir sizes. The smaller the numerical difference between sizes, the greater the\nsimilarity between shoes.\n\nIf that shoe dataset had two numeric features, size and price, you can combine\nthem into a single number representing similarity. First scale the data so\nboth features are comparable:\n\n- Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.\n- Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data to quantiles and scale to \\\\(\\[0,1\\]\\\\).\n\nNext, combine the two features by calculating the\n[root mean squared error](/machine-learning/glossary#RMSE) (RMSE).\nThis rough measure of similarity is given by\n\\\\(\\\\sqrt{\\\\frac{(s_i - s_j)\\^2+(p_i - p_j)\\^2}{2}}\\\\).\n\nFor a simple example, calculate similarity for two shoes with US sizes\n8 and 11, and prices 120 and 150. Since we don't have enough data to understand\nthe distribution, we'll scale the data without normalizing or using\nquantiles.\n\n| Action | Method |\n|-------------------------------|--------------------------------------------------------------------------------------------------------|\n| Scale the size. | Assume a maximum possible shoe size of 20. Divide 8 and 11 by the maximum size 20 to get 0.4 and 0.55. |\n| Scale the price. | Divide 120 and 150 by the maximum price 150 to get 0.8 and 1. |\n| Find the difference in size. | \\\\(0.55 - 0.4 = 0.15\\\\) |\n| Find the difference in price. | \\\\(1 - 0.8 = 0.2\\\\) |\n| Calculate the RMSE. | \\\\(\\\\sqrt{\\\\frac{0.2\\^2+0.15\\^2}{2}} = 0.17\\\\) |\n\nIntuitively, your similarity measure should increase when feature data is more\nsimilar. Instead, your similarity measure (RMSE) actually decreases. Make your\nsimilarity measure follow your intuition by subtracting it from 1.\n\n\\\\\\[\\\\text{Similarity} = 1 - 0.17 = 0.83\\\\\\]\n\nIn general, you can prepare numerical data as described in\n[Prepare data](/machine-learning/clustering/prepare-data), then combine the\ndata by using Euclidean distance.\n\nWhat if that dataset included both shoe size and shoe color? Color is\n[categorical data](/machine-learning/glossary#categorical_data),\ndiscussed in Machine Learning Crash Course in\n[Working with categorical data](/machine-learning/crash-course/categorical-data).\nCategorical data is harder to combine with the numerical size data. It can be:\n\n- Single-valued (univalent), such as a car's color (\"white\" or \"blue\" but never both)\n- Multi-valued (multivalent), such as a movie's genre (a movie can be both \"action\" and \"comedy,\" or only \"action\")\n\nIf univalent data matches, for example in the case of two pairs of blue shoes,\nthe similarity between the examples is 1. Otherwise, similarity is 0.\n\nMultivalent data, like movie genres, is harder to work with. If there are a\nfixed set of movie genres, similarity can be calculated using the ratio of\ncommon values, called\n[**Jaccard similarity**](https://wikipedia.org/wiki/Jaccard_index). Example\ncalculations of Jaccard similarity:\n\n- \\[\"comedy\",\"action\"\\] and \\[\"comedy\",\"action\"\\] = 1\n- \\[\"comedy\",\"action\"\\] and \\[\"action\"\\] = ½\n- \\[\"comedy\",\"action\"\\] and \\[\"action\", \"drama\"\\] = ⅓\n- \\[\"comedy\",\"action\"\\] and \\[\"non-fiction\",\"biographical\"\\] = 0\n\nJaccard similarity is not the only possible manual similarity measure for\ncategorical data. Two other examples:\n\n- **Postal codes** can be converted into latitude and longitude before calculating Euclidean distance between them.\n- **Color** can be converted into numeric RGB values, with differences in values combined into Euclidean distance.\n\nSee [Working with categorical data](/machine-learning/crash-course/categorical-data)\nfor more.\n\nIn general, a manual similarity measure must directly correspond\nto actual similarity. If your chosen metric does not, then it isn't encoding the\ninformation you want it to encode.\n\nPre-process your data carefully before calculating a similarity measure. The\nexamples on this page are simplified. Most real-world datasets are large\nand complex. As previously mentioned, quantiles are a good default choice\nfor processing numeric data.\n\nAs the complexity of data increases, it becomes harder to create a manual\nsimilarity measure. In that situation, switch to a\n**supervised similarity measure**, where a supervised machine\nlearning model calculates similarity. This will be discussed in more detail\nlater.\n\n| **Key terms:**\n|\n| \u003cbr /\u003e\n|\n| - [categorical data](/machine-learning/glossary#categorical_data)\n| - [Jaccard similarity](https://wikipedia.org/wiki/Jaccard_index)\n|\n| \u003cbr /\u003e\n|\n\u003cbr /\u003e"]]