Datasets: Imbalanced datasets

Consider a dataset containing a categorical label whose value is either Positive or Negative. In a balanced dataset, the number of Positive and Negative labels is about equal. However, if one label is more common than the other label, then the dataset is imbalanced. The predominant label in an imbalanced dataset is called the majority class; the less common label is called the minority class.

The following table provides generally accepted names and ranges for different degrees of imbalance:

Percentage of data belonging to minority class	Degree of imbalance
20-40% of the dataset	Mild
1-20% of the dataset	Moderate
<1% of the dataset	Extreme

For example, consider a virus detection dataset in which the minority class represents 0.5% of the dataset and the majority class represents 99.5%. Extremely imbalanced datasets like this one are common in medicine since most subjects won't have the virus.

Figure 5. Bar graph with two bars. One bar shows about 200
negative classes; the other bar shows 1 positive class. — **Figure 5.** Extremely imbalanced dataset.

Imbalanced datasets sometimes don't contain enough minority class examples to train a model properly. That is, with so few positive labels, the model trains almost exclusively on negative labels and can't learn enough about positive labels. For example, if the batch size is 50, many batches would contain no positive labels.

Often, especially for mildly imbalanced and some moderately imbalanced datasets, imbalance isn't a problem. So, you should first try training on the original dataset. If the model works well, you're done. If not, at least the suboptimal model provides a good baseline for future experiments. Afterwards, you can try the following techniques to overcome problems caused by imbalanced datasets.

Downsampling and Upweighting

One way to handle an imbalanced dataset is to downsample and upweight the majority class. Here are the definitions of those two new terms:

Downsampling (in this context) means training on a disproportionately low subset of the majority class examples.
Upweighting means adding an example weight to the downsampled class equal to the factor by which you downsampled.

Step 1: Downsample the majority class. Consider the virus dataset shown in Figure 5 that has a ratio of 1 positive label for every 200 negative labels. Downsampling by a factor of 10 improves the balance to 1 positive to 20 negatives (5%). Although the resulting training set is still moderately imbalanced, the proportion of positives to negatives is much better than the original extremely imbalanced proportion (0.5%).

Figure 6. Bar graph with two bars. One bar shows 20 negative
classes; the other bar shows 1 positive class. — **Figure 6.** Downsampling.

Step 2: Upweight the downsampled class: Add example weights to the downsampled class. After downsampling by a factor of 10, the example weight should be 10. (Yes, this might seem counterintuitive, but we'll explain why later on.)

Figure 7. A two-step diagram of downsampling and upweighting.
Step 1: Downsampling extracts random examples from the majority
class. Step 2: Upweighting adds weight to the downsampled
examples. — **Figure 7.** Upweighting.

The term weight doesn't refer to model parameters (like, w₁ or w₂). Here, weight refers to example weights, which increases the importance of an individual example during training. An example weight of 10 means the model treats the example as 10 times as important (when computing loss) than it would an example of weight 1.

The weight should be equal to the factor you used to downsample:

\[\text{ \{example weight\} = \{original example weight\} × \{downsampling factor\} }\]

It may seem odd to add example weights after downsampling. After all, you are trying to make the model improve on the minority class, so why upweight the majority class? In fact, upweighting the majority class tends to reduce prediction bias. That is, upweighting after downsampling tends to reduce the delta between the average of your model's predictions and the average of your dataset's labels.

Click the icon to learn more about downsampling and upweighting.

You might also be wondering whether upweighting cancels out downsampling. Yes, to some degree. However, the combination of upweighting and downsampling enables mini-batches to contain enough minority classes to train an effective model.

Upweighting the minority class by itself is usually easier to implement than downsampling and upweighting the majority class. However, upweighting the minority class tends to increase prediction bias.

Downsampling the majority class brings the following benefits:

Faster convergence: During training, the model sees the minority class more often, which helps the model converge faster.
Less disk space: By consolidating the majority class into fewer examples with larger weights, the model uses less disk space storing those weights. This savings allows more disk space for the minority class, so the model can collect a greater number and a wider range of examples from that class.

Unfortunately, you must usually downsample the majority class manually, which can be time consuming during training experiments, particularly for very large datasets.

Rebalance ratios

How much should you downsample and upweight to rebalance your dataset? To determine the answer, you should experiment with the rebalancing ratio, just as you would experiment with other hyperparameters. That said, the answer ultimately depends on the following factors:

The batch size
The imbalance ratio
The number of examples in the training set

Ideally, each batch should contain multiple minority class examples. Batches that don't contain sufficient minority classes will train very poorly. The batch size should be several times greater than the imbalance ratio. For example, if the imbalance ratio is 100:1, then the batch size should be at least 500.

Exercise: Check your understanding

Consider the following situation:

The training set contains a little over one billion examples.
The batch size is 128.
The imbalance ratio is 100:1, so the training set is divided as follows:
- ~1 billion majority class examples.
- ~10 million minority class examples.

Which of the following statements are true?

Increasing the batch size to 1,024 will improve the resulting model.

With a batch size of 1,024, each batch will average about 10 minority class examples, which should help to train a much better model.

Keeping the batch size at 128 but downsampling (and upweighting) to 20:1 will improve the resulting model.

Thanks to downsampling, each batch of 128 will average about 21 minority class examples, which should be sufficient for training a useful model. Note that downsampling reduces the number of examples in the training set from a little over one billion to about 60 million.

The current hyperparameters are fine.

With a batch size of 128, each batch will average about 1 minority class example, which might be insufficient to train a useful model.

Labels (10 min)

Dividing the original dataset (10 min)