The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:
Feature name | # of categories | Sample categories |
---|---|---|
snowed_today | 2 | True, False |
skill_level | 3 | Beginner, Practitioner, Expert |
season | 4 | Winter, Spring, Summer, Autumn |
day_of_week | 7 | Monday, Tuesday, Wednesday |
planet | 8 | Mercury, Venus, Earth |
When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. With a vocabulary encoding, the model treats each possible categorical value as a separate feature. During training, the model learns different weights for each category.
For example, suppose you are creating a model to predict a car's price based,
in part, on a categorical feature named car_color
.
Perhaps red cars are worth more than green cars.
Since manufacturers offer a limited number of exterior colors, car_color
is
a low-dimensional categorical feature.
The following illustration suggests a vocabulary (possible values) for
car_color
:
Exercise: Check your intuition
"Red"
is not a floating-point number. You
must convert strings like "Red"
to floating-point numbers.
Index numbers
Machine learning models can only manipulate floating-point numbers. Therefore, you must convert each string to a unique index number, as in the following illustration:
Exercise: Check your intuition
"Black"
(index number 5) to be 5 times more meaningful
to the model than "Orange"
(index number 1).
"Black"
(index number 5) to be
5 times more meaningful to the model than "Orange"
(index number 1).
One-hot encoding
The next step in building a vocabulary is to convert each index number to its one-hot encoding. In a one-hot encoding:
- Each category is represented by a vector (array) of N elements, where N
is the number of categories. For example, if
car_color
has eight possible categories, then the one-hot vector representing will have eight elements. - Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.
For example, the following table shows the one-hot encoding for each in
car_color
:
Feature | Red | Orange | Blue | Yellow | Green | Black | Purple | Brown |
---|---|---|---|---|---|---|---|---|
"Red" | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
"Orange" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
"Blue" | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
"Yellow" | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
"Green" | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
"Black" | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
"Purple" | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
"Brown" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.
The following illustration suggests the various transformations in the vocabulary representation:
Sparse representation
A feature whose values are predominately zero (or empty) is termed a
sparse feature. Many
categorical features, such as car_color
, tend to be sparse features.
Sparse representation
means storing the position of the 1.0
in a sparse vector. For example, the one-hot vector for "Blue"
is:
[0, 0, 1, 0, 0, 0, 0, 0]
Since the 1
is in position 2 (when starting the count at 0), the
sparse representation for the preceding one-hot vector is:
2
Notice that the sparse representation consumes far less memory than the eight-element one-hot vector. Importantly, the model must train on the one-hot vector, not the sparse representation.
Outliers in categorical data
Like numerical data, categorical data also contains outliers. Suppose
car_color
contains not only the popular colors, but also some rarely used
outlier colors, such as "Mauve"
or "Avocado"
.
Rather than giving each of these outlier colors a separate category, you
can lump them into a single "catch-all" category called out-of-vocabulary
(OOV). In other words, all the outlier colors are binned into a single
outlier bucket. The system learns a single weight for that outlier bucket.
Encoding high-dimensional categorical features
Some categorical features have a high number of dimensions, such as those in the following table:
Feature name | # of categories | Sample categories |
---|---|---|
words_in_english | ~500,000 | "happy", "walking" |
US_postal_codes | ~42,000 | "02114", "90301" |
last_names_in_Germany | ~850,000 | "Schmidt", "Schneider" |
When the number of categories is high, one-hot encoding is usually a bad choice. Embeddings, detailed in a separate Embeddings module, are usually a much better choice. Embeddings substantially reduce the number of dimensions, which benefits models in two important ways:
- The model typically trains faster.
- The built model typically infers predictions more quickly. That is, the model has lower latency.
Hashing (also called the hashing trick) is a less common way to reduce the number of dimensions.