Numerical data: Qualities of good numerical features

This unit has explored ways to map raw data into suitable feature vectors. Good numerical features share the qualities described in this section.

Clearly named

Each feature should have a clear, sensible, and obvious meaning to any human on the project. For example, the meaning of the following feature value is confusing:

Not recommended

house_age: 851472000

In contrast, the following feature name and value are far clearer:

Recommended

house_age_years: 27

Checked or tested before training

Although this module has devoted a lot of time to outliers, the topic is important enough to warrant one final mention. In some cases, bad data (rather than bad engineering choices) causes unclear values. For example, the following user_age_in_years came from a source that didn't check for appropriate values:

Not recommended

user_age_in_years: 224

But people can be 24 years old:

Recommended

user_age_in_years: 24

Check your data!

Sensible

A "magic value" is a purposeful discontinuity in an otherwise continuous feature. For example, suppose a continuous feature named watch_time_in_seconds can hold any floating-point value between 0 and 30 but represents the absence of a measurement with the magic value -1:

Not recommended

watch_time_in_seconds: -1

A watch_time_in_seconds of -1 would force the model to try to figure out what it means to watch a movie backwards in time. The resulting model would probably not make good predictions.

A better technique is to create a separate Boolean feature that indicates whether or not a watch_time_in_seconds value is supplied. For example:

Recommended

watch_time_in_seconds: 4.82
is_watch_time_in_seconds_defined=True

watch_time_in_seconds: 0
is_watch_time_in_seconds_defined=False

Now consider a discrete numerical feature whose values must belong to a finite set of values. In this case, when a value is missing, signify that missing value using a new value in the finite set. With a discrete feature, the model will learn different weights for each value, including original weights for missing features.

Scrubbing (5 min)

Polynomial transforms (5 min)