Datasets: Transforming data

AI-generated Key Takeaways

Machine learning models require all data, including features like street names, to be transformed into numerical (floating-point) representations for training.
Normalization is crucial for optimizing model training by converting existing floating-point features to a specific range.
When dealing with large datasets, selecting a relevant subset of data for training is essential for model performance.
Protecting user privacy by excluding Personally Identifiable Information (PII) from datasets is a critical consideration.

Machine learning models can only train on floating-point values. However, many dataset features are not naturally floating-point values. Therefore, one important part of machine learning is transforming non-floating-point features to floating-point representations.

For example, suppose street names is a feature. Most street names are strings, such as "Broadway" or "Vilakazi". Your model can't train on "Broadway", so you must transform "Broadway" to a floating-point number. The Categorical Data module explains how to do this.

Additionally, you should even transform most floating-point features. This transformation process, called normalization, converts floating-point numbers to a constrained range that improves model training. The Numerical Data module explains how to do this.

Sample data when you have too much of it

Some organizations are blessed with an abundance of data. When the dataset contains too many examples, you must select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions.

Filter examples containing PII

Good datasets omit examples containing Personally Identifiable Information (PII). This policy helps safeguard privacy but can influence the model.

See the Safety and Privacy module later in the course for more on these topics.

Dividing the original dataset (10 min)

Generalization (5 min)

Datasets: Transforming data Stay organized with collections Save and categorize content based on your preferences.

AI-generated Key Takeaways

Sample data when you have too much of it

Filter examples containing PII

Datasets: Transforming data