Fairness: Test Your Knowledge | Google for Developers

True or false: Historical bias occurs when a model is trained on old data.

True

False

Engineers are training a regression model to predict the calorie content of meals based on a variety of feature data they've scraped from recipe websites around the world, including serving size, ingredients, and preparation techniques. Which of the following data issues are potential sources of bias that should be investigated further?

Choose as many answers as you see fit.

Approximately 4,000 of the 40,000 training examples were missing a value for the feature "serving size".

Approximately 5,000 of the training examples had measurements in imperial units (ounces, pounds, etc.), whereas the other 35,000 examples had measurements in metric units (grams, liters, etc.).

Approximately 100 of the 40,000 training examples had ingredient values that seemed highly likely to be incorrect (e.g., 100 sticks of butter).

Some popular meals were underrepresented in the training data relative to other popular meals (e.g., there were 200 training examples for dosa, but only 10 for pizza).

A sarcasm-detection model was trained on 80,000 text messages: 40,000 messages sent by adults (18 years and older) and 40,000 messages sent by minors (less than 18 years old). The model was then evaluated on a test set of 20,000 messages: 10,000 from adults and 10,000 from minors. The following confusion matrices show the results for each group (a positive prediction signifies a classification of "sarcastic"; a negative prediction signifies a classification of "not sarcastic"):

Adults

True Positives (TPs): 512	False Positives (FPs): 51
False Negatives (FNs): 36	True Negatives (TNs): 9401
Precision = TP/(TP + FP) = 0.909
Recall = TP/(TP + FN) = 0.934

Minors

True Positives (TPs): 2147	False Positives (FPs): 96
False Negatives (FNs): 2177	True Negatives (TNs): 5580
Precision = TP/(TP + FP) = 0.957
Recall = TP/(TP + FN) = 0.497

Which of the following statements about the model's test-set performance are true?

Choose as many answers as you see fit.

The model performs better on examples from adults than on examples from minors.

The 10,000 messages sent by adults are a class-imbalanced dataset.

The 10,000 messages sent by minors are a class-imbalanced dataset.

Approximately 50% of messages sent by minors are classified as "sarcastic" incorrectly.

The model fails to classify approximately 50% of minors' sarcastic messages as "sarcastic."

Which of the following hypotheses could explain the discrepancies in subgroup performance on the test set for the sarcasm-detection model above?

Choose as many answers as you see fit.

The model errs too much on the side of predicting "sarcastic." As a result, it makes more errors when classifying minors' text messages, because there are more sarcastic messages from minors in the test set.

The model was evaluated on more negative (not-sarcastic) examples from minors than from adults, resulting in more errors for minors.

Sarcasm in minors' text messages was more subtle, and thus less likely to be flagged by the model.

There are far fewer actual sarcastic messages from adults than from minors. If the model were evaluated on a more class-balanced set of adult messages, its recall might drop for that subgroup.

Engineers are working on retraining the sarcasm model above to address inconsistencies in sarcasm-detection accuracy across age demographics, but the model has already been released into production. Which of the following stopgap strategies will help mitigate errors in the model's predictions?

Restrict the model's usage to text messages sent by minors.

Adjust the model output so that it returns "sarcastic" for all text messages sent by minors, regardless of what the model originally predicted.

When the model predicts "not sarcastic" for text messages sent by minors, adjust the output so the model returns a value of "unsure" instead.