This unit has explored ways to map raw data into suitable feature vectors. Good numerical features share the qualities described in this section.
Clearly named
Each feature should have a clear, sensible, and obvious meaning to any human on the project. For example, the meaning of the following feature value is confusing:
Not recommended
house_age: 851472000
In contrast, the following feature name and value are far clearer:
Recommended
house_age_years: 27
Checked or tested before training
Although this module has devoted a lot of time to
outliers, the topic is
important enough to warrant one final mention. In some cases, bad data
(rather than bad engineering choices) causes unclear values. For example,
the following user_age_in_years
came from a source that didn't check for
appropriate values:
Not recommended
user_age_in_years: 224
But people can be 24 years old:
Recommended
user_age_in_years: 24
Check your data!
Sensible
A "magic value" is a purposeful discontinuity in an otherwise continuous
feature. For example, suppose a continuous feature named watch_time_in_seconds
can hold any floating-point value between 0 and 30 but represents the absence
of a measurement with the magic value -1:
Not recommended
watch_time_in_seconds: -1
A watch_time_in_seconds
of -1 would force the model to try to figure
out what it means to watch a movie backwards in time. The resulting model would
probably not make good predictions.
A better technique is to create a separate Boolean feature that indicates
whether or not a watch_time_in_seconds
value is supplied. For example:
Recommended
watch_time_in_seconds: 4.82
is_watch_time_in_seconds_defined=Truewatch_time_in_seconds: 0
is_watch_time_in_seconds_defined=False
Now consider a discrete numerical feature whose values must belong to a finite set of values. In this case, when a value is missing, signify that missing value using a new value in the finite set. With a discrete feature, the model will learn different weights for each value, including original weights for missing features.