数据集:转换数据
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
机器学习模型只能基于浮点值进行训练。不过,许多数据集特征本身不是浮点值。因此,机器学习的一个重要部分是将非浮点特征转换为浮点表示法。
例如,假设 street names
是地图项。大多数街道名称都是字符串,例如“Broadway”或“Vilakazi”。您的模型无法使用“Broadway”进行训练,因此您必须将“Broadway”转换为浮点数。“分类数据”模块介绍了具体操作。
此外,您还应转换大多数浮点地图项。此转换过程称为标准化,可将浮点数转换为受限范围,从而改进模型训练。“数值数据”模块介绍了如何执行此操作。
对数据进行采样(如果数据量过多)
有些组织拥有丰富的数据。
如果数据集包含的示例过多,您必须选择一组子集进行训练。请尽可能选择与模型预测最相关的子集。
包含个人身份信息的过滤条件示例
优质数据集会省略包含个人身份信息 (PII) 的示例。此政策有助于保护隐私,但可能会影响模型。
如需详细了解这些主题,请参阅本课程稍后的“安全和隐私”模块。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eMachine learning models require all data, including features like street names, to be transformed into numerical (floating-point) representations for training.\u003c/p\u003e\n"],["\u003cp\u003eNormalization is crucial for optimizing model training by converting existing floating-point features to a specific range.\u003c/p\u003e\n"],["\u003cp\u003eWhen dealing with large datasets, selecting a relevant subset of data for training is essential for model performance.\u003c/p\u003e\n"],["\u003cp\u003eProtecting user privacy by excluding Personally Identifiable Information (PII) from datasets is a critical consideration.\u003c/p\u003e\n"]]],[],null,["# Datasets: Transforming data\n\nMachine learning models can only train on floating-point values.\nHowever, many dataset features are *not* naturally floating-point values.\nTherefore, one important part of machine learning is transforming\nnon-floating-point features to floating-point representations.\n\nFor example, suppose `street names` is a feature. Most street names\nare strings, such as \"Broadway\" or \"Vilakazi\".\nYour model can't train on \"Broadway\", so you must transform \"Broadway\"\nto a floating-point number. The [Categorical Data\nmodule](/machine-learning/crash-course/categorical-data)\nexplains how to do this.\n\nAdditionally, you should even transform most floating-point features.\nThis transformation process, called\n[**normalization**](/machine-learning/glossary#normalization), converts\nfloating-point numbers to a constrained range that improves model training.\nThe [Numerical Data\nmodule](/machine-learning/crash-course/numerical-data)\nexplains how to do this.\n\nSample data when you have too much of it\n----------------------------------------\n\nSome organizations are blessed with an abundance of data.\n\nWhen the dataset contains too many examples, you must select a *subset*\nof examples for training. When possible, select the subset that is most\nrelevant to your model's predictions.\n\nFilter examples containing PII\n------------------------------\n\nGood datasets omit examples containing Personally Identifiable Information\n(PII). This policy helps safeguard privacy but can influence the model.\n\nSee the Safety and Privacy module later in the course for more on these topics.\n| **Key terms:**\n|\n- [Normalization](/machine-learning/glossary#normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]