資料集:轉換資料
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
機器學習模型只能訓練浮點值。不過,許多資料集功能並非自然的浮點值。因此,機器學習的一項重要部分,就是將非浮點特徵轉換為浮點表示法。
舉例來說,假設 street names
是某個功能。大部分的街道名稱都是字串,例如「Broadway」或「Vilakazi」。模型無法在「Broadway」上訓練,因此您必須將「Broadway」轉換為浮點數。類別資料模組會說明如何進行這項操作。
此外,您也應轉換大多數浮點功能。這項轉換程序稱為「正規化」,可將浮點數轉換為受限範圍,進而改善模型訓練。數值資料模組會說明如何執行這項操作。
當資料量過多時,請取樣資料
有些機構擁有大量資料。
如果資料集包含太多範例,您必須選取範例子集進行訓練。盡可能選取與模型預測結果最相關的子集。
含有個人識別資訊的篩選器範例
優質資料集會省略含有個人識別資訊 (PII) 的示例。這項政策有助於保護隱私權,但可能會影響模型。
如要進一步瞭解這些主題,請參閱課程後續的「安全性與隱私權」單元。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-27 (世界標準時間)。
[null,null,["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eMachine learning models require all data, including features like street names, to be transformed into numerical (floating-point) representations for training.\u003c/p\u003e\n"],["\u003cp\u003eNormalization is crucial for optimizing model training by converting existing floating-point features to a specific range.\u003c/p\u003e\n"],["\u003cp\u003eWhen dealing with large datasets, selecting a relevant subset of data for training is essential for model performance.\u003c/p\u003e\n"],["\u003cp\u003eProtecting user privacy by excluding Personally Identifiable Information (PII) from datasets is a critical consideration.\u003c/p\u003e\n"]]],[],null,["# Datasets: Transforming data\n\nMachine learning models can only train on floating-point values.\nHowever, many dataset features are *not* naturally floating-point values.\nTherefore, one important part of machine learning is transforming\nnon-floating-point features to floating-point representations.\n\nFor example, suppose `street names` is a feature. Most street names\nare strings, such as \"Broadway\" or \"Vilakazi\".\nYour model can't train on \"Broadway\", so you must transform \"Broadway\"\nto a floating-point number. The [Categorical Data\nmodule](/machine-learning/crash-course/categorical-data)\nexplains how to do this.\n\nAdditionally, you should even transform most floating-point features.\nThis transformation process, called\n[**normalization**](/machine-learning/glossary#normalization), converts\nfloating-point numbers to a constrained range that improves model training.\nThe [Numerical Data\nmodule](/machine-learning/crash-course/numerical-data)\nexplains how to do this.\n\nSample data when you have too much of it\n----------------------------------------\n\nSome organizations are blessed with an abundance of data.\n\nWhen the dataset contains too many examples, you must select a *subset*\nof examples for training. When possible, select the subset that is most\nrelevant to your model's predictions.\n\nFilter examples containing PII\n------------------------------\n\nGood datasets omit examples containing Personally Identifiable Information\n(PII). This policy helps safeguard privacy but can influence the model.\n\nSee the Safety and Privacy module later in the course for more on these topics.\n| **Key terms:**\n|\n- [Normalization](/machine-learning/glossary#normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]