This course is out of date. We will remove this course in July 2024.

Interpreting Loss Curves

Machine learning would be a breeze if all our loss curves looked like this the first time we trained our model:

A plot showing the ideal loss curve when training a machine learning model.
The loss curve plots loss on the y-axis against the number of training steps on
the x-axis. As the number of training steps increases, loss begins high, then
decreases exponentially, and ultimately flattens out to reach a minimum
loss.

But in reality, loss curves can be quite challenging to interpret. Use your understanding of loss curves to answer the following questions.

1. My Model Won't Train!

Your friend Mel and you continue working on a unicorn appearance predictor. Here's your first loss curve.

A loss curve plot with the same axes as the previous plot. Here, the loss does
not flatten out, but instead erratically increases and decreases, such that the
plot oscillates.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal the answer.

Your model is not converging. Try these debugging steps:

Check if your features can predict the labels by following the steps in Model Debugging.
Check your data against a data schema to detect bad examples.
If training looks unstable, as in this plot, then reduce your learning rate to prevent the model from bouncing around in parameter space.
Simplify your dataset to 10 examples that you know your model can predict on. Obtain a very low loss on the reduced dataset. Then continue debugging your model on the full dataset.
Simplify your model and ensure the model outperforms your baseline. Then incrementally add complexity to the model.

2. My Loss Exploded!

Mel shows you another curve. What’s going wrong here and how can she fix it? Write your answer below.

A loss curve plot that shows how the loss decreasing up to a certain number of
training steps and then suddenly increasing with further training
steps.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal the answer.

A large increase in loss is typically caused by anomalous values in input data. Possible causes are:

NaNs in input data.
Exploding gradient due to anomalous data.
Division by zero.
Logarithm of zero or negative numbers.

To fix an exploding loss, check for anomalous data in your batches, and in your engineered data. If the anomaly appears problematic, then investigate the cause. Otherwise, if the anomaly looks like outlying data, then ensure the outliers are evenly distributed between batches by shuffling your data.

3. My Metrics are Contradictory!

Mel wants your take on another curve. What’s going wrong and how can she fix it? Write your answer below.

The image shows two plots. The plot on the left shows the ideal loss curve.
The plot on the right shows the recall metric staying at 0 even as the number of
training steps increases.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal the answer.

Recall is stuck at 0 because your examples' classification probability is never higher than the threshold for positive classification. This situation often occurs in problems with a large class imbalance. Remember that ML libraries, such as TF Keras, typically use a default threshold of 0.5 to calculate classification metrics.

Try these steps:

Lower your classification threshold.

Check threshold-invariant metrics, such as AUC.

4. Testing Loss is Too Damn High!

Mel shows you the loss curves for training and testing datasets and asks "What's wrong?” Write your answer below.

A loss curve plot showing divergence between training and test loss as a model
is trained.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal the answer.

Your model is overfitting to the training data. Try these steps:

Reduce model capacity.
Add regularization.
Check that the training and test splits are statistically equivalent.

5. My Model Gets Stuck

You're patient when Mel returns a few days later with yet another curve. What's going wrong here and how can Mel fix it?

A plot of a loss curve showing the loss beginning to converge with training but then displaying repeated patterns that look like a rectangular wave.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal the answer.

Your loss is showing repetitive, step-like behavior. It's probable that the input data seen by your model is itself exhibiting repetitive behavior. Ensure that shuffling is removing repetitive behavior from input data.

It's Working!

"It's working perfectly now!" Mel exclaims. She leans back into her chair triumphantly and heaves a big sigh. The curve looks great and you beam with accomplishment. Mel and you take a moment to discuss the following additional checks for validating your model.

real-world metrics

baselines

absolute loss for regression problems

other metrics for classification problems

A plot showing a loss curve that converges.

Check Your Understanding

Model Metrics