For the following questions, click the desired arrow to check your answer:
You’re preprocessing data for a regression model. What
transformations are mandatory? Check all that apply.
Converting all non-numeric features into numeric features.
Correct. This is a mandatory transformation. You must convert strings
to some numeric representation because you can’t do matrix multiplication
on a string.
Normalize numeric data.
Normalizing numeric data could help, but it’s an optional
quality transformation.
Consider the chart below. Which data transformation technique would
likely be the most productive to start with and why? Assume your goal is to
find a linear relationship between roomsPerPerson and house price.
Z-score
Z-score is a good choice if the outliers aren’t extreme.
However, the outliers are extreme here.
Clipping
Clipping is a good choice here because the data set contains extreme
outliers. You should fix extreme outliers before applying other
normalizations.
Log Scaling
Log scaling is a good choice if your data confirms to the power law
distribution. However, this data conforms to a normal distribution
rather than a power law distribution.
Bucketing (binning) with quantile boundaries
Quantile bucketing can be a good approach for skewed data, but in this
case, this skew is due in part to a few extreme outliers. Also, you want
the model to learn a linear relationship. Therefore, you should keep
roomsPerPerson numeric rather than transform it to categories, which is
what bucketing does. Instead, try a normalization technique.
Consider the chart below. Which data transformation technique would
likely be the most productive to start with and why?
Z-score
Z-score is a good choice if the outliers aren’t so extreme that you
need clipping. That’s not the case here. The way the data is skewed should
be a hint.
Clipping
Clipping is a good choice when there are extreme outliers. This chart,
however, is showing a power law distribution, and there’s another
normalization technique that’s better for addressing that.
Log Scaling
Log scaling is a good choice here because the data conforms to the
power law distribution.
Bucketing (binning) with quantile boundaries
Quantile bucketing can be a good approach for skewed data. However,
you are
looking for the model to learn a linear relationship. Therefore, you
should keep your data numeric and avoid putting it in buckets.
Try a normalization
technique instead.
Consider the chart below. Would a linear model make a good prediction
about the relationship between compression-ratio and city-mpg? If not, how
might you transform the data to better train the model?
Yes, the model would probably find a linear relationship and make pretty
accurate predictions.
While the model would find a linear relationship, the model wouldn’t
make very accurate predictions. You can try training this data set in
the Data
Modeling exercise to better understand why.
No. The model would probably be more accurate after scaling.
You could apply linear scaling, but the slope of the relationship
between compression-ratio and city-mpg would look the same. What
would help you more is to see two separate slopes--one for the cluster of
points in the lower compression-ratio and another for the higher.
No. There seems to be two different behaviors happening. Setting a
threshold in the middle and using a bucketized feature might help you
better understand what's happening in those two areas.
Correct. It’s important to be clear about why and how you are setting
the boundaries. In the
Data Modeling exercise,
you’ll learn more about
exactly how this approach can help you create a better model.
A peer team is telling you about the progress they’ve made on their ML
project. They computed a vocabulary and trained a model offline. They want
to avoid staleness issues, however, so they’re now about to train a
different model online. What might happen next?
The model will stay up to date as new data arrives. The other team will
need to continually monitor the input data.
Although avoiding model staleness is the main benefit of dynamic
training, using a vocabulary with a model trained offline will lead to
problems.
They may find that the indices they’re using don’t correspond to the
vocab.
Correct. Warn your colleagues about the perils of training/serving
skew, and then recommend that they take Google’s course on Data
Preparation and Feature Engineering for ML to learn more.