Pinpointing the Bias

When the Jigsaw team initially evaluated the Perspective API toxicity model, they found that it performed well on the test data set. But they were concerned there was still a possibility that bias could manifest in the model's predictions if there were any systemic errors in the training data. To ensure training-data quality, they took the additional step of auditing the labels provided by human raters to ensure they were accurate.

Yet, despite these proactive steps taken to eliminate bias in the model's training data, users still uncovered a false-positive problem for comments containing identity terms. How did this happen?

A second audit of the training set revealed that the majority of comments containing identity terms for race, religion, and gender were labeled toxic. These labels were correct; most online comments containing these identity terms were indeed toxic. But as a result of this skew, the model learned a correlation between presence of these identity terms and toxicity, which did not accurately reflect the neutral connontations of the terms themselves.

The team had uncovered a critical gap in the model's training data: an area in which there was not sufficient training data to represent a key aspect of reality. The training set did not contain sufficient examples of nontoxic identity comments for the model to learn that the terms themselves were neutral and that the context in which they were used was what mattered.