Gradient boosting (optional unit)

Page Summary

Gradient boosting employs loss functions and trains weak models to predict the gradient of the loss, differing from simple signed error calculations.
For regression with squared error loss, the gradient is proportional to the signed error, but this doesn't hold for other problem types.
Newton's method, incorporating both first and second derivatives, can enhance gradient boosted trees by optimizing leaf values and influencing tree structure.
YDF, a specific implementation, always applies Newton's method to refine leaf values and optionally uses it for tree structure optimization.

In regression problems, it makes sense to define the signed error as the difference between the prediction and the label. However, in other types of problems this strategy often leads to poor results. A better strategy used in gradient boosting is to:

Define a loss function similar to the loss functions used in neural networks. For example, the entropy (also known as log loss) for a classification problem.
Train the weak model to predict the gradient of the loss according to the strong model output.

Formally, given a loss function $L(y,p)$ where $y$ is a label and $p$ a prediction, the pseudo response $z_i$ used to train the weak model at step $i$ is:

$$ z_i = \frac {\partial L(y, F_i)} {\partial F_i} $$

where:

$F_i$ is the prediction of the strong model.

The preceding example was a regression problem: The objective is to predict a numerical value. In the case of regression, squared error is a common loss function:

$$ L(y,p) = (y - p)^2 $$

In this case, the gradient is:

$$ z = \frac {\partial L(y, F_i)} {\partial F_i} = \frac {\partial(y-p)^2} {\partial p} = -2(y - p) = 2 \ \text{signed error} $$

In order words, the gradient is the signed error from our example with a factor of 2. Note that constant factors do not matter because of the shrinkage. Note that this equivalence is only true for regression problems with squared error loss. For other supervised learning problems (for example, classification, ranking, regression with percentile loss), there is no equivalence between the gradient and a signed error.

Leaf and structure optimization with Newton's method step

Newton's method is an optimization method like gradient descent. However, unlike the gradient descent that only uses the gradient of the function to optimize, Newton's method uses both the gradient (first derivative) and the second derivative of the function for optimization.

A step of gradient descent is as follows:

$$ x_{i+1} = x_i - \frac {df}{dx}(x_i) = x_i - f'(x_i) $$

and Newton's method as as follows:

$$ x_{i+1} = x_i - \frac {\frac {df}{dx} (x_i)} {\frac {d^2f}{d^2x} (x_i)} = x_i - \frac{f'(x_i)}{f''(x_i)}$$

Optionally, Newton's method can be integrated to the training of gradient boosted trees in two ways:

Once a tree is trained, a step of Newton is applied on each leaf and overrides its value. The tree structure is untouched; only the leaf values change.
During the growth of a tree, conditions are selected according to a score that includes a component of the Newton formula. The structure of the tree is impacted.

YDF Code

In YDF:

YDF always applies a Newton step on the leaf (option 1).
You can enable option 2 with use_hessian_gain=True.

Introduction

Overfitting and regularization

Gradient boosting (optional unit) Stay organized with collections Save and categorize content based on your preferences.

Page Summary

Leaf and structure optimization with Newton's method step

Gradient boosting (optional unit)