Numerical data: Conclusion
Stay organized with collections
Save and categorize content based on your preferences.
A machine learning (ML) model's health is determined by its data. Feed your
model healthy data and it will thrive; feed your model junk and its
predictions will be worthless.
Best practices for working with numerical data:
- Remember that your ML model interacts with the data in the
feature vector,
not the data in the
dataset.
- Normalize most
numerical features.
- If your first normalization strategy doesn't succeed, consider a different
way to normalize your data.
- Binning, also referred to as
bucketing, is sometimes
better than normalizing.
- Considering what your data should look like, write verification
tests to validate those expectations. For example:
- The absolute value of latitude should never exceed 90. You can write a
test to check if a latitude value greater than 90 appears in your data.
- If your data is restricted to the state of Florida, you can write tests
to check that the latitudes fall between 24 through 31, inclusive.
- Visualize your data with scatter plots and histograms. Look for
anomalies.
- Gather statistics not only on the entire dataset but also on smaller
subsets of the dataset. That's because aggregate statistics sometimes
obscure problems in smaller sections of a dataset.
- Document all your data transformations.
Data is your most valuable resource, so treat it with care.
What's next
Congratulations on finishing this module!
We encourage you to explore the various MLCC modules
at your own pace and interest. If you'd like to follow a recommended order,
we suggest that you move to the following module next:
Representing categorical data.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.
[null,null,["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eA machine learning model's predictive ability is directly dependent on the quality of data it's trained on.\u003c/p\u003e\n"],["\u003cp\u003eNumerical features often benefit from normalization or binning to improve model performance.\u003c/p\u003e\n"],["\u003cp\u003eData validation through verification tests and visualizations is crucial for identifying and addressing potential issues.\u003c/p\u003e\n"],["\u003cp\u003eUnderstanding data distribution through statistics on both the entire dataset and its subsets is essential for identifying hidden problems.\u003c/p\u003e\n"],["\u003cp\u003eMaintaining thorough documentation of all data transformations ensures reproducibility and facilitates model understanding.\u003c/p\u003e\n"]]],[],null,["# Numerical data: Conclusion\n\nA machine learning (ML) model's health is determined by its data. Feed your\nmodel healthy data and it will thrive; feed your model junk and its\npredictions will be worthless.\n\nBest practices for working with numerical data:\n\n- Remember that your ML model interacts with the data in the [**feature vector**](/machine-learning/glossary#feature_vector), not the data in the [**dataset**](/machine-learning/glossary#dataset).\n- [**Normalize**](/machine-learning/glossary#normalization) most numerical [**features**](/machine-learning/glossary#feature).\n- If your first normalization strategy doesn't succeed, consider a different way to normalize your data.\n- [**Binning**](/machine-learning/glossary#binning), also referred to as [**bucketing**](/machine-learning/glossary#bucketing), is sometimes better than normalizing.\n- Considering what your data *should* look like, write verification tests to validate those expectations. For example:\n - The absolute value of latitude should never exceed 90. You can write a test to check if a latitude value greater than 90 appears in your data.\n - If your data is restricted to the state of Florida, you can write tests to check that the latitudes fall between 24 through 31, inclusive.\n- Visualize your data with scatter plots and histograms. Look for anomalies.\n- Gather statistics not only on the entire dataset but also on smaller subsets of the dataset. That's because aggregate statistics sometimes obscure problems in smaller sections of a dataset.\n- Document all your data transformations.\n\nData is your most valuable resource, so treat it with care.\n\nAdditional Information\n----------------------\n\n- The *Rules of Machine Learning* guide contains a valuable [Feature Engineering](https://developers.google.com/machine-learning/rules-of-ml/#ml_phase_ii_feature_engineering) section.\n\nWhat's next\n-----------\n\nCongratulations on finishing this module!\n\nWe encourage you to explore the various [MLCC modules](/machine-learning/crash-course)\nat your own pace and interest. If you'd like to follow a recommended order,\nwe suggest that you move to the following module next:\n**[Representing categorical data](/machine-learning/crash-course/categorical-data)**.\n\n*** ** * ** ***\n\n| **Key terms:**\n|\n| - [Binning](/machine-learning/glossary#binning)\n| - [Bucketing](/machine-learning/glossary#bucketing)\n| - [Dataset](/machine-learning/glossary#dataset)\n| - [Feature](/machine-learning/glossary#feature)\n| - [Feature vector](/machine-learning/glossary#feature_vector)\n- [Normalization](/machine-learning/glossary#normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]