在左侧未归一化的数据集中,特征 1 和特征 2 分别在 x 轴和 y 轴上绘制,但它们的刻度不同。在左侧,红色示例与蓝色更接近或更相似,而与黄色更不接近或不相似。在右侧,经过 Z 分标准化后,特征 1 和特征 2 具有相同的尺度,并且红色示例看起来更接近黄色示例。标准化数据集可更准确地衡量点之间的相似性。
[null,null,["最后更新时间 (UTC):2025-02-25。"],[[["\u003cp\u003eThis document reviews data preparation for clustering, focusing on scaling features to the same range.\u003c/p\u003e\n"],["\u003cp\u003eNormalization via Z-scores is suitable for Gaussian distributions, while log transforms are applied to power-law distributions.\u003c/p\u003e\n"],["\u003cp\u003eFor datasets that don't conform to standard distributions, quantiles are recommended to measure similarity between data points.\u003c/p\u003e\n"],["\u003cp\u003eHandling missing data involves either removing affected examples or the feature, or predicting missing values using a machine learning model.\u003c/p\u003e\n"]]],[],null,["# Data preparation\n\nThis section reviews the data preparation steps most relevant to clustering\nfrom the\n[Working with numerical data](/machine-learning/crash-course/numerical-data)\nmodule in Machine Learning Crash Course.\n\nIn clustering, you calculate the similarity between two examples by combining\nall the feature data for those examples into a numeric value. This requires the\nfeatures to have the same scale, which can be accomplished by normalizing,\ntransforming, or creating quantiles. If you want to transform\nyour data without inspecting its distribution, you can default to quantiles.\n\nNormalizing data\n----------------\n\nYou can transform data for multiple features to the same scale by normalizing\nthe data.\n\n### Z-scores\n\nWhenever you see a dataset roughly shaped like a\n[**Gaussian distribution**](https://wikipedia.org/wiki/Normal_distribution),\nyou should calculate [**z-scores**](/machine-learning/data-prep/transform/normalization)\nfor the data. Z-scores are the number of standard deviations a value is from the\nmean. You can also use z-scores when the dataset isn't large enough for\nquantiles.\n\nSee\n[Z-score scaling](/machine-learning/crash-course/numerical-data/normalization#z-score_scaling)\nto review the steps.\n\nHere is a visualization of two features of a dataset before and after\nz-score scaling:\n**Figure 1: A comparison of feature data before and after normalization.**\n\nIn the unnormalized dataset on the left, Feature 1 and Feature 2,\nrespectively graphed on the x and y axes, don't have the same scale. On the\nleft, the red example\nappears closer, or more similar, to blue than to yellow. On the right, after\nz-score scaling, Feature 1 and Feature 2 have the same scale, and the red\nexample appears closer to the yellow example. The normalized dataset gives a\nmore accurate measure of similarity between points.\n\n### Log transforms\n\nWhen a dataset perfectly conforms to a\n[**power law**](https://wikipedia.org/wiki/Power_law) distribution, where data\nis heavily clumped at the lowest values, use a log transform. See\n[Log scaling](/machine-learning/crash-course/numerical-data/normalization#log_scaling)\nto review the steps.\n\nHere is a visualization of a power-law dataset before and after a log transform:\n**Figure 2: A power law distribution.** **Figure 3: A log transform of Figure 2.**\n\nBefore log scaling (Figure 2), the red example appears more similar to yellow.\nAfter log scaling (Figure 3), red appears more similar to blue.\n\nQuantiles\n---------\n\nBinning the data into quantiles works well when the dataset does not conform\nto a known distribution. Take this dataset, for example:\n**Figure 4: An uncategorizable distribution prior to any preprocessing.**\n\nIntuitively, two examples are more similar if only a few examples fall between\nthem, irrespective of their values, and more dissimilar if many examples\nfall between them. The visualization above makes it difficult to see the total\nnumber of examples that fall between red and yellow, or between red and blue.\n\nThis understanding of similarity can be brought out by dividing the dataset into\n**quantiles** , or intervals that each contain equal numbers of examples, and\nassigning the quantile index to each example. See\n[Quantile bucketing](/machine-learning/crash-course/numerical-data/binning#quantile_bucketing)\nto review the steps.\n\nHere is the previous distribution divided into quantiles, showing that red is\none quantile away from yellow and three quantiles away from blue:\n**Figure 5: The distribution in Figure 4 after conversion into 20 quantiles.**\n\nYou can choose any number \\\\(n\\\\) of quantiles. However, for the quantiles to\nmeaningfully represent the underlying data, your dataset should have at least\n\\\\(10n\\\\) examples. If you don't have enough data, normalize instead.\n\n### Check your understanding\n\nFor the following questions, assume you have enough data to create quantiles.\n\n#### Question one\n\nHow should you process the data distribution shown in the preceding graph? \nCreate quantiles. \nCorrect. Because the distribution does not match a standard data distribution, you should default to creating quantiles. \nNormalize. \nYou typically normalize data if:\n\n- The data distribution is Gaussian.\n- You have some insight into what the data represents in the real that suggests the data shouldn't be transformed nonlinearly.\nNeither case applies here. The data distribution isn't Gaussian because it isn't symmetric. And you don't know what these values represent in the real world. \nLog transform. \nThis isn't a perfect power-law distribution, so don't use a log transform.\n\n#### Question two\n\nHow would you process this data distribution? \nNormalize. \nCorrect. This is a Gaussian distribution. \nCreate quantiles. \nIncorrect. Since this is a Gaussian distribution, the preferred transform is normalization. \nLog transform. \nIncorrect. Only apply a log transform to power-law distributions.\n\nMissing data\n------------\n\nIf your dataset has examples with missing values for a certain feature, but\nthose examples occur rarely, you can remove these examples. If those examples\noccur frequently, you can either remove that feature altogether,\nor you can predict the missing values from other examples using a machine\nlearning model. For example, you can\n[impute missing numerical data](/machine-learning/crash-course/overfitting/data-characteristics#complete_vs_incomplete_examples)\nby using a\nregression model trained on existing feature data.\n| **Note:** The problem of missing data is not specific to clustering. In supervised learning, you can impute an \"unknown\" value to the feature. However, you cannot impute an \"unknown\" value when designing a similarity measure, because it's not possible to quantify the similarity between \"unknown\" and any known value.\n| **Key terms:**\n|\n| - [normalization](/machine-learning/glossary#normalization)\n| - [quantile](/machine-learning/glossary#quantile)"]]