Which of the following model's predictions have been affected
by selection bias?
A German handwriting recognition smartphone app uses a model that frequently incorrectly
classifies ß (Eszett)
characters as B
characters, because it was trained on a corpus of American handwriting
samples, mostly written in English.
This model was affected by a type of selection bias called
coverage bias: the training data (American English handwriting) was not
representative of the type of data provided by the model's target
audience (German handwriting).
Engineers built a model to predict the likelihood of a person
developing diabetes based on their daily food intake. The model
was trained on 10,000 "food diaries" collected from a randomly chosen
group of people worldwide representing a variety of different age groups, ethnic
backgrounds, and genders. However, when the model was deployed, it had very poor
accuracy. Engineers subsequently discovered that food diary participants
were reluctant to admit the true volume of unhealthy foods they ate,
and were more likely to document consumption of nutritious food
than less healthy snacks.
There is no selection bias in this model; participants who provided
training data were a representative sampling of users and were chosen randomly.
Instead, this model was affected by reporting bias. Ingestion
of unhealthy foods was reported at a much lower frequency than true real-world
occurrence.
Engineers at a company developed a model to predict staff turnover rates
(the percentage of employees quitting their jobs each year) based on data
collected from a survey sent to all employees. After several years
of use, engineers determined that the model underestimated turnover by more
than 20%. When conducting exit interviews with employees leaving the company,
they learned that more than 80% of people who were dissatisfied with their jobs
chose not to complete the survey, compared to a company-wide opt-out rate of 15%.
This model was affected by a type of selection bias called non-response
bias. People who were dissatisfied with their jobs were underrepresented
in the training data set because they opted out of the company-wide survey
at much higher rates than the entire employee population.
Engineers developing a movie-recommendation system hypothesized that
people who like horror movies will also like science-fiction movies. When
they trained a model on 50,000 users' watchlists, however, it showed no
such correlation between preferences for horror and for sci-fi;
instead it showed a strong correlation between preferences for horror
and for documentaries. This seemed odd to them, so they retrained the
model five more times using different hyperparameters. Their final
trained model showed a 70% correlation between preferences for horror
and for sci-fi, so they confidently released it into production.
There is no evidence of selection bias, but this model may have
instead been affected by experimenter's bias, as the engineers kept
iterating on their model until it confirmed their preexisting hypothesis.
Evaluating for Bias
A sarcasm-detection model
was trained on 80,000 text messages: 40,000 messages sent by adults (18 years
and older) and 40,000 messages sent by minors (less than 18 years old). The
model was then evaluated on a test set of 20,000 messages: 10,000 from adults
and 10,000 from minors. The following confusion matrices show the results for
each group (a positive prediction signifies a classification of "sarcastic";
a negative prediction signifies a classificaton of "not sarcastic"):
Adults
True Positives (TPs): 512
False Positives (FPs): 51
False Negatives (FNs): 36
True Negatives (TNs): 9401
$$\text{Precision} = \frac{TP}{TP+FP} = 0.909$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.934$$
Minors
True Positives (TPs): 2147
False Positives (FPs): 96
False Negatives (FNs): 2177
True Negatives (TNs): 5580
$$\text{Precision} = \frac{TP}{TP+FP} = 0.957$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.497$$
Explore the options below.
Which of the following statements about the model's test-set
performance are true?
Overall, the model performs better on examples from adults
than on examples from minors.
The model achieves both precision and recall rates over
90% when detecting sarcasm in text messages from adults.
While the model achieves a slightly higher precision rate for
minors than adults, the recall rate is substantially lower
for minors, resulting in less reliable predictions for this
group.
The model fails to classify approximately 50% of
minors' sarcastic messages as "sarcastic."
The recall rate of 0.497 for minors indicates that
the model predicts "not sarcastic" for approximately
50% of minors' sarcastic texts.
Approximately 50% of messages sent by minors
are classified as "sarcastic" incorrectly.
The precision rate of 0.957 indicates that
over 95% of minors' messages classified as "sarcastic"
are actually sarcastic.
If we compare the number of messages from adults that are actually
sarcastic (TP+FN = 548)
with the number of messages that are actually
not sarcastic (TN + FP = 9452), we
see that "not sarcastic" labels outnumber
"sarcastic" labels by a ratio of approximately
17:1.
If we compare the number of messages from minors that are actually
sarcastic (TP+FN = 4324)
with the number of messages that are actually
not sarcastic (TN + FP = 5676), we
see that there is a 1.3:1 ratio of
"not sarcastic" labels to "sarcastic" labels. Given that
the distribution of labels between the two classes is quite
close to 50/50, this is not a class-imbalanced dataset.
Explore the options below.
Engineers are working on retraining this model to address
inconsistencies in sarcasm-detection accuracy across age
demographics, but the model has already been released into
production. Which of the following stopgap strategies
will help mitigate errors in the model's predictions?
Restrict the model's usage to text messages sent by adults.
The model performs well on text messages from adults
(with precision and recall rates both above 90%), so restricting
its use to this group will sidestep the systematic errors in
classifying minors' text messages.
When the model predicts "not sarcastic"
for text messages sent by minors, adjust the output
so the model returns a value of "unsure" instead.
The precision rate for text messages sent by minors is high,
which means that when the model predicts "sarcastic" for
this group, it is nearly always correct.
The problem is that recall is very low for minors; The model
fails to identify sarcasm in approximately 50% of examples. Given
that the model's negative predictions for minors are no better
than random guesses, we can avoid these errors by not
providing a prediction in these cases.
Restrict the model's usage to text messages sent by minors.
The systematic errors in this model are specific to text
messages sent by minors. Restricting the model's use to the
group more susceptible to error would not help.
Adjust the model output so that it returns "sarcastic" for
all text messages sent by minors, regardless of
what the model originally predicted.
Always predicting "sarcastic" for minors' text messages would increase
the recall rate from 0.497 to 1.0, as the model would no longer
fail to identify any messages as sarcastic. However, this increase in
recall would come at the expense of precision. All the true negatives
would be changed to false positives:
True Positives (TPs): 4324
False Positives (FPs): 5676
False Negatives (FNs): 0
True Negatives (TNs): 0
which would decrease the precision rate from 0.957 to
0.432. So, adding this calibration would change the
type of error but would not mitigate the magnitude of the error.