数值数据:总结
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
机器学习 (ML) 模型的运行状况取决于其数据。向模型提供优质数据,模型就会茁壮成长;向模型提供垃圾数据,其预测结果将毫无价值。
处理数值数据的最佳实践:
- 请注意,机器学习模型与特征向量中的数据进行交互,而不是与数据集中的数据进行交互。
- 归一化大多数数值特征。
- 如果您的第一种归一化策略不成功,请考虑采用其他方式来归一化数据。
- 分箱(也称为分桶)有时比标准化更有效。
- 考虑一下您的数据应该是什么样子,编写验证测试来验证这些预期。例如:
- 纬度的绝对值不得超过 90。您可以编写一个测试,检查数据中是否出现大于 90 的纬度值。
- 如果您的数据仅限于佛罗里达州,您可以编写测试来检查纬度是否介于 24 到 31 之间(包括这两个数值)。
- 使用散点图和直方图直观呈现数据。查找异常。
- 您不仅可以收集整个数据集的统计信息,还可以收集数据集的较小子集的统计信息。这是因为汇总统计数据有时会掩盖数据集中较小部分存在的问题。
- 记录所有数据转换。
数据是您最宝贵的资源,因此请妥善处理。
后续步骤
恭喜您完成本单元!
我们鼓励您根据自己的兴趣和进度,探索各种 MLCC 模块。如果您想按照建议的顺序学习,我们建议您接下来学习以下模块:表示分类数据。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2024-11-10。
[null,null,["最后更新时间 (UTC):2024-11-10。"],[[["\u003cp\u003eA machine learning model's predictive ability is directly dependent on the quality of data it's trained on.\u003c/p\u003e\n"],["\u003cp\u003eNumerical features often benefit from normalization or binning to improve model performance.\u003c/p\u003e\n"],["\u003cp\u003eData validation through verification tests and visualizations is crucial for identifying and addressing potential issues.\u003c/p\u003e\n"],["\u003cp\u003eUnderstanding data distribution through statistics on both the entire dataset and its subsets is essential for identifying hidden problems.\u003c/p\u003e\n"],["\u003cp\u003eMaintaining thorough documentation of all data transformations ensures reproducibility and facilitates model understanding.\u003c/p\u003e\n"]]],[],null,["# Numerical data: Conclusion\n\nA machine learning (ML) model's health is determined by its data. Feed your\nmodel healthy data and it will thrive; feed your model junk and its\npredictions will be worthless.\n\nBest practices for working with numerical data:\n\n- Remember that your ML model interacts with the data in the [**feature vector**](/machine-learning/glossary#feature_vector), not the data in the [**dataset**](/machine-learning/glossary#dataset).\n- [**Normalize**](/machine-learning/glossary#normalization) most numerical [**features**](/machine-learning/glossary#feature).\n- If your first normalization strategy doesn't succeed, consider a different way to normalize your data.\n- [**Binning**](/machine-learning/glossary#binning), also referred to as [**bucketing**](/machine-learning/glossary#bucketing), is sometimes better than normalizing.\n- Considering what your data *should* look like, write verification tests to validate those expectations. For example:\n - The absolute value of latitude should never exceed 90. You can write a test to check if a latitude value greater than 90 appears in your data.\n - If your data is restricted to the state of Florida, you can write tests to check that the latitudes fall between 24 through 31, inclusive.\n- Visualize your data with scatter plots and histograms. Look for anomalies.\n- Gather statistics not only on the entire dataset but also on smaller subsets of the dataset. That's because aggregate statistics sometimes obscure problems in smaller sections of a dataset.\n- Document all your data transformations.\n\nData is your most valuable resource, so treat it with care.\n\nAdditional Information\n----------------------\n\n- The *Rules of Machine Learning* guide contains a valuable [Feature Engineering](https://developers.google.com/machine-learning/rules-of-ml/#ml_phase_ii_feature_engineering) section.\n\nWhat's next\n-----------\n\nCongratulations on finishing this module!\n\nWe encourage you to explore the various [MLCC modules](/machine-learning/crash-course)\nat your own pace and interest. If you'd like to follow a recommended order,\nwe suggest that you move to the following module next:\n**[Representing categorical data](/machine-learning/crash-course/categorical-data)**.\n\n*** ** * ** ***\n\n| **Key terms:**\n|\n| - [Binning](/machine-learning/glossary#binning)\n| - [Bucketing](/machine-learning/glossary#bucketing)\n| - [Dataset](/machine-learning/glossary#dataset)\n| - [Feature](/machine-learning/glossary#feature)\n| - [Feature vector](/machine-learning/glossary#feature_vector)\n- [Normalization](/machine-learning/glossary#normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]