A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.
Representation
From Raw Data to Features
The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.
From Raw Data to Features
From Raw Data to Features
From Raw Data to Features
- Dictionary maps each street name to an int in {0, ...,V-1}
- Now represent one-hot vector above as <i>
Properties of a Good Feature
Feature values should appear with non-zero value more than a small handful of times in the dataset.
my_device_id:8SK982ZZ1242Z
device_model:galaxy_s6
Properties of a Good Feature
Features should have a clear, obvious meaning.
user_age:23
user_age:123456789
Properties of a Good Feature
Features shouldn't take on "magic" values
(use an additional boolean feature like watch_time_is_defined instead!)
watch_time: -1.0
watch_time: 1.023
watch_time_is_defined: 1.0
Properties of a Good Feature
The definition of a feature shouldn't change over time.
(Beware of depending on other ML systems!)
city_id:"br/sao_paulo"
inferred_city_cluster_id:219
Properties of a Good Feature
Distribution should not have extreme outliers
Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).
The Binning Trick
The Binning Trick
- Create several boolean bins, each mapping to a new unique feature
- Allows model to fit a different value for each bin
Good Habits
KNOW YOUR DATA
- Visualize: Plot histograms, rank most to least common.
- Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
- Monitor: Feature quantiles, number of examples over time?