For the following questions, click the desired arrow to check your answer:
Imagine that you have a dataset with a 1:1000
positive-negative ratio. Unfortunately, your model is always
predicting the majority class. What technique would best help
you deal with this problem? Note that you want the model to
report a calibrated probability.
Just downsample the negative examples.
That's a good start, but you'll alter the model's base rate,
so it is no longer calibrated.
Downsample the negative examples (the majority class). Then upweight
the downsampled class by the same factor.
This is an effective way to deal with imbalanced data and still get
the real distribution of labels. Note that it matters whether you
care if the model reports a calibrated probability or not. If it
doesn't need to be calibrated, you don't need to worry about changing
the base rate.
Which techniques lose data from the tail of a dataset? Check all that apply.
PII filtering
Filtering PII from your data can remove information in the tail, skewing your distribution.
Weighting
Example weighting changes the importance of different examples, but it doesn't lose information. In fact, adding weight to the tail examples can help your model learn behavior about the tail.
Downsampling
The tail of feature distributions will lose information in
downsampling. However, since we typically downsample the
majority class, this loss isn't usually a big problem.
Normalization
Normalization operates on individual examples, so it doesn't cause sampling bias.
You are working on a classification problem, and you randomly split the
data into training, evaluation, and testing sets. Your classifier looks
like it’s working perfectly! But in production, the classifier is a
total failure. You later discover that the problem was caused by the
random split. What kinds of data are susceptible to this problem?
Time series data
Random splitting divides each cluster across the test/train split,
providing a “sneak preview” to the model that won’t be available in
production.
Data that doesn't change much over time
If your data doesn't change very much over time, you'll have
better chances with a random split. For example, you might want
to identify the breed of dog in photos, or predict patients at
risk for heart defect based on past data of biometrics. In both
cases, the data generally doesn't change over time, so random
splitting shouldn't cause a problem.
Groupings of data
The test set will always be too similar to the training set because
clusters of similar data are in both sets. The model will appear to
have better predictive power than it does.
Data with burstiness (data arriving in intermittent bursts as
opposed to a continuous stream)
Clusters of similar data (the bursts) will show up in both
training and testing. The model will make better predictions in
testing than with new data.