As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.
Where might bias lurk? Here are three red flags to look out for in your data set.
Missing Feature Values
If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.
For example, the table below shows a summary of key stats for a subset of features in
the California Housing
dataset,
stored in a pandas DataFrame
and generated via DataFrame.describe
. Note that all features have a count
of 17000, indicating there
are no missing values:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|
count | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 | 17000.0 |
mean | -119.6 | 35.6 | 2643.7 | 1429.6 | 501.2 | 3.9 | 207.3 |
std | 2.0 | 2.1 | 2179.9 | 1147.9 | 384.5 | 1.9 | 116.0 |
min | -124.3 | 32.5 | 2.0 | 3.0 | 1.0 | 0.5 | 15.0 |
25% | -121.8 | 33.9 | 1462.0 | 790.0 | 282.0 | 2.6 | 119.4 |
50% | -118.5 | 34.2 | 2127.0 | 1167.0 | 409.0 | 3.5 | 180.4 |
75% | -118.0 | 37.7 | 3151.2 | 1721.0 | 605.2 | 4.8 | 265.0 |
max | -114.3 | 42.0 | 37937.0 | 35682.0 | 6082.0 | 15.0 | 500.0 |
Suppose instead that three features (population
, households
, and median_income
)
only had a count of 3000
—in other words, that there were 14,000 missing values for
each feature:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|
count | 17000.0 | 17000.0 | 17000.0 | 3000.0 | 3000.0 | 3000.0 | 17000.0 |
mean | -119.6 | 35.6 | 2643.7 | 1429.6 | 501.2 | 3.9 | 207.3 |
std | 2.0 | 2.1 | 2179.9 | 1147.9 | 384.5 | 1.9 | 116.0 |
min | -124.3 | 32.5 | 2.0 | 3.0 | 1.0 | 0.5 | 15.0 |
25% | -121.8 | 33.9 | 1462.0 | 790.0 | 282.0 | 2.6 | 119.4 |
50% | -118.5 | 34.2 | 2127.0 | 1167.0 | 409.0 | 3.5 | 180.4 |
75% | -118.0 | 37.7 | 3151.2 | 1721.0 | 605.2 | 4.8 | 265.0 |
max | -114.3 | 42.0 | 37937.0 | 35682.0 | 6082.0 | 15.0 | 500.0 |
These 14,000 missing values would make it much more difficult to accurately correlate median income of households with median house prices. Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.
Unexpected Feature Values
When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.
For example, take a look at the following excerpted examples from the California housing data set:
longitude | latitude | total_rooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|
1 | -121.7 | 38.0 | 7105.0 | 3523.0 | 1088.0 | 5.0 | 0.2 |
2 | -122.4 | 37.8 | 2479.0 | 1816.0 | 496.0 | 3.1 | 0.3 |
3 | -122.0 | 37.0 | 2813.0 | 1337.0 | 477.0 | 3.7 | 0.3 |
4 | -103.5 | 43.8 | 2212.0 | 803.0 | 144.0 | 5.3 | 0.2 |
5 | -117.1 | 32.8 | 2963.0 | 1162.0 | 556.0 | 3.6 | 0.2 |
6 | -118.0 | 33.7 | 3396.0 | 1542.0 | 472.0 | 7.4 | 0.4 |
Can you pinpoint any unexpected feature values?
Data Skew
Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.
If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.
Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.
If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.