Loss is a numerical metric that describes how wrong a model's predictions are. Loss measures the distance between the model's predictions and the actual labels. The goal of training a model is to minimize the loss, reducing it to its lowest possible value.
In the following image, you can visualize loss as arrows drawn from the data points to the model. The arrows show how far the model's predictions are from the actual values.
Figure 9. Loss is measured from the actual value to the predicted value.
Distance of loss
In statistics and machine learning, loss measures the difference between the predicted and actual values. Loss focuses on the distance between the values, not the direction. For example, if a model predicts 2, but the actual value is 5, we don't care that the loss is negative $ -3 $ ($ 2-5=-3 $). Instead, we care that the distance between the values is $ 3 $. Thus, all methods for calculating loss remove the sign.
The two most common methods to remove the sign are the following:
- Take the absolute value of the difference between the actual value and the prediction.
- Square the difference between the actual value and the prediction.
Types of loss
In linear regression, there are four main types of loss, which are outlined in the following table.
Loss type | Definition | Equation |
---|---|---|
L_{1} loss | The sum of the absolute values of the difference between the predicted values and the actual values. | $ ∑ | actual\ value - predicted\ value | $ |
Mean absolute error (MAE) | The average of L_{1} losses across a set of examples. | $ \frac{1}{N} ∑ | actual\ value - predicted\ value | $ |
L_{2} loss | The sum of the squared difference between the predicted values and the actual values. | $ ∑(actual\ value - predicted\ value)^2 $ |
Mean squared error (MSE) | The average of L_{2} losses across a set of examples. | $ \frac{1}{N} ∑ (actual\ value - predicted\ value)^2 $ |
The functional difference between L_{1} loss and L_{2} loss (or between MAE and MSE) is squaring. When the difference between the prediction and label is large, squaring makes the loss even larger. When the difference is small (less than 1), squaring makes the loss even smaller.
When processing multiple examples at once, we recommend averaging the losses across all the examples, whether using MAE or MSE.
Calculating loss example
Using the previous best fit line, we'll calculate L_{2} loss for a single example. From the best fit line, we had the following values for weight and bias:
- $ \small{Weight: -3.6} $
- $ \small{Bias: 30} $
If the model predicts that a 2,370-pound car gets 21.5 miles per gallon, but it actually gets 24 miles per gallon, we would calculate the L_{2} loss as follows:
Value | Equation | Result |
---|---|---|
Prediction | $\small{bias + (weight * feature\ value)}$ $\small{30 + (-3.6*2.37)}$ |
$\small{21.5}$ |
Actual value | $ \small{ label } $ | $ \small{ 24 } $ |
L_{2} loss | $ \small{ (prediction - actual\ value)^2} $ $\small{ (21.5 - 24)^2 }$ |
$\small{6.25}$ |
In this example, the L_{2} loss for that single data point is 6.25.
Choosing a loss
Deciding whether to use MAE or MSE can depend on the dataset and the way you want to handle certain predictions. Most feature values in a dataset typically fall within a distinct range. For example, cars are normally between 2000 and 5000 pounds and get between 8 to 50 miles per gallon. An 8,000-pound car, or a car that gets 100 miles per gallon, is outside the typical range and would be considered an outlier.
An outlier can also refer to how far off a model's predictions are from the real values. For instance, a 3,000-pound car or a car that gets 40 miles per gallon are within the typical ranges. However, a 3,000-pound car that gets 40 miles per gallon would be an outlier in terms of the model's prediction because the model would predict that a 3,000-pound car would get between 18 and 20 miles per gallon.
When choosing the best loss function, consider how you want the model to treat outliers. For instance, MSE moves the model more toward the outliers, while MAE doesn't. L_{2} loss incurs a much higher penalty for an outlier than L_{1} loss. For example, the following images show a model trained using MAE and a model trained using MSE. The red line represents a fully trained model that will be used to make predictions. The outliers are closer to the model trained with MSE than to the model trained with MAE.
Figure 10. A model trained with MSE moves the model closer to the outliers.
Figure 11. A model trained with MAE is farther from the outliers.
Note the relationship between the model and the data:
MSE. The model is closer to the outliers but further away from most of the other data points.
MAE. The model is further away from the outliers but closer to most of the other data points.
Check Your Understanding
Consider the following two plots: