Linear regression: Gradient descent

Gradient descent is a mathematical technique that iteratively finds the weights and bias that produce the model with the lowest loss. Gradient descent finds the best weight and bias by repeating the following process for a number of user-defined iterations.

The model begins training with randomized weights and biases near zero, and then repeats the following steps:

Calculate the loss with the current weight and bias.
Determine the direction to move the weights and bias that reduce loss.
Move the weight and bias values a small amount in the direction that reduces loss.
Return to step one and repeat the process until the model can't reduce the loss any further.

The diagram below outlines the iterative steps gradient descent performs to find the weights and bias that produce the model with the lowest loss.

Figure 12. Illustration of the gradient descent process.

Figure 12. Gradient descent is an iterative process that finds the weights and bias that produce the model with the lowest loss.

Click the plus icon to learn more about the math behind gradient descent.

At a concrete level, we can walk through the gradient descent steps using a small dataset with seven examples for a car's heaviness in pounds and its miles per gallon rating:

Pounds in 1000s (feature)	Miles per gallon (label)
3.5	18
3.69	15
3.44	18
3.43	16
4.34	15
4.42	14
2.37	24

The model starts training by setting the weight and bias to zero:

$$ \small{Weight:\ 0} $$ $$ \small{Bias:\ 0} $$ $$ \small{y = 0 + 0(x_1)} $$

Calculate MSE loss with the current model parameters:

$$ \small{Loss = \frac{(18-0)^2 + (15-0)^2 + (18-0)^2 + (16-0)^2 + (15-0)^2 + (14-0)^2 + (24-0)^2}{7}} $$ $$ \small{Loss= 303.71} $$

Calculate the slope of the tangent to the loss function at each weight and the bias:

$$ \small{Weight\ slope: -119.7} $$ $$ \small{Bias\ slope: -34.3} $$

Click the plus icon to learn about calculating slope.

To get the slope for the lines tangent to the weight and bias, we take the derivative of the loss function with respect to the weight and the bias, and then solve the equations.

We'll write the equation for making a prediction as:
$ f_{w,b}(x) = (w*x)+b $.

We'll write the actual value as: $ y $.

We'll calculate MSE using:
$ \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 $
where $i$ represents the $ith$ training example and $M$ represents the number of examples.

Weight derivative

The derivative of the loss function with respect to the weight is written as:
$ \frac{\partial }{\partial w} \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 $

and evaluates to:
$ \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)}) * 2x_{(i)} $

First we sum each predicted value minus the actual value and then multiply it by two times the feature value. Then we divide the sum by the number of examples. The result is the slope of the line tangent to the value of the weight.

If we solve this equation with a weight and bias equal to zero, we get -119.7 for the line's slope.

Bias derivative

The derivative of the loss function with respect to the bias is written as:
$ \frac{\partial }{\partial b} \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 $

and evaluates to:
$ \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)}) * 2 $

First we sum each predicted value minus the actual value and then multiply it by two. Then we divide the sum by the number of examples. The result is the slope of the line tangent to the value of the bias.

If we solve this equation with a weight and bias equal to zero, we get -34.3 for the line's slope.

Move a small amount in the direction of the negative slope to get the next weight and bias. For now, we'll arbitrarily define the "small amount" as 0.01:

$$ \small{New\ weight = old\ weight - (small\ amount * weight\ slope)} $$ $$ \small{New\ bias = old\ bias - (small\ amount * bias\ slope)} $$ $$ \small{New\ weight = 0 - (0.01)*(-119.7)} $$ $$ \small{New\ bias = 0 - (0.01)*(-34.3)} $$ $$ \small{New\ weight = 1.2} $$ $$ \small{New\ bias = 0.34} $$

Use the new weight and bias to calculate the loss and repeat. Completing the process for six iterations, we'd get the following weights, biases, and losses:

Iteration	Weight	Bias	Loss (MSE)
1	0	0	303.71
2	1.2	0.34	170.67
3	2.75	0.59	67.3
4	3.17	0.72	50.63
5	3.47	0.82	42.1
6	3.68	0.9	37.74

You can see that the loss gets lower with each updated weight and bias. In this example, we stopped after six iterations. In practice, a model trains until it converges. When a model converges, additional iterations don't reduce loss more because gradient descent has found the weights and bias that nearly minimize the loss.

If the model continues to train past convergence, loss begins to fluctuate in small amounts as the model continually updates the parameters around their lowest values. This can make it hard to verify that the model has actually converged. To confirm the model has converged, you'll want to continue training until the loss has stabilized.

Model convergence and loss curves

When training a model, you'll often look at a loss curve to determine if the model has converged. The loss curve shows how the loss changes as the model trains. The following is what a typical loss curve looks like. Loss is on the y-axis and iterations are on the x-axis:

Figure 13. Graph of loss curve showing a steep decline and then a gentle decline.

Figure 13. Loss curve showing the model converging around the 1,000th-iteration mark.

You can see that loss dramatically decreases during the first few iterations, then gradually decreases before flattening out around the 1,000th-iteration mark. After 1,000 iterations, we can be mostly certain that the model has converged.

In the following figures, we draw the model at three points during the training process: the beginning, the middle, and the end. Visualizing the model's state at snapshots during the training process solidifies the link between updating the weights and bias, reducing loss, and model convergence.

In the figures, we use the derived weights and bias at a particular iteration to represent the model. In the graph with the data points and the model snapshot, blue loss lines from the model to the data points show the amount of loss. The longer the lines, the more loss there is.

In the following figure, we can see that around the second iteration the model would not be good at making predictions because of the high amount of loss.

Figure 14. Loss curve and corresponding graph of the model, which tilts away from the data points.

Figure 14. Loss curve and snapshot of the model at the beginning of the training process.

At around the 400th-iteration, we can see that gradient descent has found the weight and bias that produce a better model.

Figure 15. Loss curve and corresponding graph of the model, which cuts through the data points but not at the optimal angle.

Figure 15. Loss curve and snapshot of model about midway through training.

And at around the 1,000th-iteration, we can see that the model has converged, producing a model with the lowest possible loss.

Figure 16. Loss curve and corresponding graph of the model, which fits the data well.

Figure 16. Loss curve and snapshot of the model near the end of the training process.

Exercise: Check your understanding

What's the role of gradient descent in linear regression?

Gradient descent is an iterative process that finds the best weights and bias that minimize the loss.

Gradient descent helps to determine what type of loss to use when training a model, for example, L₁ or L₂.

Gradient descent is not involved in the selection of a loss function for model training.

Gradient descent removes outliers from the dataset to help the model make better predictions.

Gradient descent doesn't change the dataset.

Convergence and convex functions

The loss functions for linear models always produce a convex surface. As a result of this property, when a linear regression model converges, we know the model has found the weights and bias that produce the lowest loss.

If we graph the loss surface for a model with one feature, we can see its convex shape. The following is the loss surface of the miles per gallon dataset used in the previous examples. Weight is on the x-axis, bias is on the y-axis, and loss is on the z-axis:

Figure 17. 3-D graph of loss surface.

Figure 17. Loss surface that shows its convex shape.

In this example, a weight of -5.44 and bias of 35.94 produce the lowest loss at 5.54:

Figure 18. 3-D graph of loss surface, with (-5.44, 35.94, 5.54) at the bottom.

Figure 18. Loss surface showing the weight and bias values that produce the lowest loss.

A linear model converges when it's found the minimum loss. Therefore, additional iterations only cause gradient descent to move the weight and bias values in very small amounts around the minimum. If we graphed the weights and bias points during gradient descent, the points would look like a ball rolling down a hill, finally stopping at the point where there's no more downward slope.

Figure 19. Convex 3-D loss surface with gradient descent points moving to the lowest point.

Figure 19. Loss graph showing gradient descent points stopping at the lowest point on the graph.

Notice that the black loss points create the exact shape of the loss curve: a steep decline before gradually sloping down until they've reached the lowest point on the loss surface.

It's important to note that the model almost never finds the exact minimum for each weight and bias, but instead finds a value very close to it. It's also important to note that the minimum for the weights and bias don't correspond to zero loss, only a value that produces the lowest loss for that parameter.

Using the weight and bias values that produce the lowest loss—in this case a weight of -5.44 and a bias of 35.94—we can graph the model to see how well it fits the data:

Figure 20. Graph of pounds in 1000s vs miles per gallon, with the model fitting the data.

Figure 20. Model graphed using the weight and bias values that produce the lowest loss.

This would be the best model for this dataset because no other weight and bias values produce a model with lower loss.

Interactive exercise: Parameters (5 min)

Hyperparameters (10 min)