Identifying Bias
In
Exercise #1: Explore the Model,
you confirmed that the model was disproportionately
classifying comments with identity terms as toxic.
Which metrics help explain the cause of this bias?
Explore the options below.
Accuracy
Accuracy measures the percent of total predictions that are correct—the percent of
predictions that are true positives or true negatives. Comparing accuracy for
different subgroups (such as different gender demographics) lets us evaluate the model's
relative performance for each group and can serve as an indicator of the effect of bias
on a model.
However, because accuracy considers correct and incorrect predictions in aggregate,
it doesn't distinguish between the two types of correct predictions and the two
types of incorrect predictions. Looking at accuracy alone, we can't determine
the underlying breakdowns of true positives, true negatives, false positives,
and false negatives, which would provide more insight into the source of the bias.
False positive rate
False positive rate (FPR) is the percentage of actual-negative examples (nontoxic
comments) that were incorrectly classified as positives (toxic comments). FPR is an
indicator of the effect of bias on the model. When we compare the FPRs for
different subgroups (such as different gender demographics), we learn that text comments
that contain identity terms related to gender are more likely to be incorrectly
classified as toxic (false positives) than comments that don't contain these terms.
However, we're not looking to measure the effect of the bias; we want to find its cause.
To do so, we need to take a closer look into the inputs to the FPR formula.
Actual negatives and actual positives
In this model's training and test datasets,
Actual positives are all the examples of comments that are toxic, and
actual negatives all the examples that are nontoxic. Given that identity
terms themselves are neutral, we'd expect a balanced number of actual-negative and
actual-positive comments containing a given identity term. If we see a disproportionally
low number of actual negatives, that tells us that the model didn't see very many examples
of identity terms used in positive or neutral contexts. In that case, the model might
learn a correlation between identity terms and toxicity.
Recall
Recall is the percentage of actual positive predictions that were correctly classified
as positives. It tells us the percentage of toxic comments the model successfully caught.
Here, we're concerned with bias related to false positives (nontoxic comments that were
classified as toxic), and recall doesn't provide any insight into this problem.
Which of the following actions might be effective methods of remediating bias in the
training data used in
Exercise #1 and
Exercise #2? Explore the options below.
Add more negative (nontoxic) examples containing identity terms to the training set.
Adding more negative examples (comments that are actually nontoxic) that
contain identity terms will help balance the training set. The model
will then see a better balance of identity terms used in toxic and nontoxic
contexts, so that it can learn that the terms themselves are neutral.
Add more positive (toxic) examples containing identity terms to the training set.
Toxic examples are already overrepresented in the subset of examples containing
identity terms. If we add even more of these examples to the training set,
we'll actually be exacerbating the existing bias rather than remediating it.
Add more negative (nontoxic) examples without identity terms to the training set.
Identity terms are already underrepresented in negative examples. Adding more negative
examples without identity terms would increase this imbalance and would not
help remediate the bias.
Add more positive (toxic) examples without identity terms to the training set.
It's possible that adding more positive examples without identity terms
might help break the association between identity terms and toxicity that
the model had previously learned.
Evaluating for Bias
You've trained your own text-toxicity classifier from scratch, which your engineering team
plans to use to automatically suppress display of comments classified as toxic. You're
concerned that any bias toward toxicity for gender-related comments might result in
suppression of nontoxic discourse about gender, and want to assess gender-related bias in
the classifier's predictions. Which of the following metrics should you use to evaluate
the model? Explore the options below.
False positive rate (FPR)
In production, the model will be used to automatically suppress positive (toxic)
predictions. Your goal is to ensure the model is not suppressing false positives (nontoxic
comments that the model misclassified as toxic) for gender-related comments at a higher
rate than for comments overall. Comparing
FPRs for gender subgroups to overall FPR is a great way to evaluate bias remediation
for your use case.
False negative rate (FNR)
FNR measures the rate at which the model misclassifies the positive class (here, "toxic")
as the negative class ("nontoxic"). For this use case, it tells you the rate at which
actually toxic comments will slip through the filter and be displayed to users.
Here, your primary concern is how bias manifests in terms of suppression of nontoxic
discourse. FNR doesn't give you any insight into this dimension of the model's
performance.
Accuracy
Accuracy measures the percentage of model predictions that were correct, and inversely,
the percentage of predictions that were wrong. For this use case, accuracy tells you
how likely it is that the filter suppressed nontoxic discourse or displayed
toxic discourse. Your primary concern is the former issue, not the latter. Since accuracy
conflates the two issues, it's not the ideal evaluation metric to use here.
AUC
AUC provides an absolute measurement of a model's predictive ability. It's a good
metric for assessing overall performance. However, here you're specifically concerned
with comment suppression rates, and AUC doesn't give you direct insight into this
issue.
A content moderator has been added to your team, and the product manager has decided to change
how your classifier will be deployed. Instead of automatically suppressing the comments
classified as toxic, the filtering software will flag these comments for the content moderator
to review. Since a human will be reviewing comments labeled as toxic, bias will no longer
manifest in the form of content suppression. Which of the following metrics might you want
to use to measure bias—and the effect of bias remediation—now? Explore the options below.
False positive rate (FPR)
False positive rate will tell you the percentage of nontoxic comments that were
misclassified as toxic. Since a human moderator will now be auditing all comments the model
labels "toxic," and should catch most false positives, FPR is no longer a primary concern.
False negative rate (FNR)
While a human moderator will be auditing all comments labeled "toxic" and ensure that
false positives are not suppressed, they will not be reviewing comments labeled "nontoxic."
This leaves open the possibility of bias related to false negatives. You can use FNR (the
percentage of actual positives that were classified as negatives) to systematically evaluate
whether toxic commments for gender subgroups are more likely to be labeled as nontoxic than
comments overall.
Precision
Precision tells you the percentage of positive predictions that are actually
positive—in this case, the percentage of "toxic" predictions that are correct. Since a
human moderator will be auditing all the "toxic" predictions, you don't need to make
precision one of your primary evaluation metrics.
Recall
Recall tells you the percentage of actual positives that were classified correctly. From
this value, you can derive the percentage of actual positives that were misclassified
(1 – recall), which is a useful metric for gauging whether gender-related toxic
comments are disproportionally misclassified as "nontoxic" compared to comments
overall.