[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eGradient boosting creates a strong predictive model by iteratively combining multiple weak models, typically decision trees.\u003c/p\u003e\n"],["\u003cp\u003eIn each iteration, a new weak model is trained to predict the errors of the current strong model, and then added to the strong model to improve its accuracy.\u003c/p\u003e\n"],["\u003cp\u003eShrinkage, similar to learning rate in neural networks, is used to control the learning speed and prevent overfitting by scaling the contribution of each weak model.\u003c/p\u003e\n"],["\u003cp\u003eGradient boosted trees are a specific implementation of gradient boosting that utilizes decision trees as the weak learners.\u003c/p\u003e\n"],["\u003cp\u003eTensorFlow Decision Forests provides a practical implementation through \u003ccode\u003etfdf.keras.GradientBoostedTreeModel\u003c/code\u003e, streamlining the model building process.\u003c/p\u003e\n"]]],[],null,["# Gradient Boosted Decision Trees\n\nLike bagging and boosting, gradient boosting is a methodology applied on top of\nanother machine learning algorithm. Informally, **gradient boosting** involves\ntwo types of models:\n\n- a \"weak\" machine learning model, which is typically a decision tree.\n- a \"strong\" machine learning model, which is composed of multiple weak models.\n\nIn gradient boosting, at each step, a new weak model is trained to predict the\n\"error\" of the current strong model (which is called the **pseudo response**).\nWe will detail \"error\" later. For now, assume \"error\" is the difference between\nthe prediction and a regressive label. The weak model (that is, the \"error\") is\nthen added to the strong model with a negative sign to reduce the error of the\nstrong model.\n\nGradient boosting is iterative. Each iteration invokes the following formula:\n\n\\\\\\[\nF_{i+1} = F_i - f_i\n\\\\\\]\n\nwhere:\n\n- $F_i$ is the strong model at step $i$.\n- $f_i$ is the weak model at step $i$.\n\nThis operation repeats until a stopping criterion is met, such as a maximum\nnumber of iterations or if the (strong) model begins to overfit as measured on a\nseparate validation dataset.\n\nLet's illustrate gradient boosting on a simple regression dataset where:\n\n- The objective is to predict $y$ from $x$.\n- The strong model is initialized to be a zero constant: $F_0(x) = 0$.\n\n**Note:** The following code is for educational aid only. In practice, you will simply call `tfdf.keras.GradientBoostedTreeModel`. \n\n # Simplified example of regressive gradient boosting.\n\n y = ... # the labels\n x = ... # the features\n\n strong_model = []\n strong_predictions = np.zeros_like(y) # Initially, the strong model is empty.\n\n for i in range(num_iters):\n\n # Error of the strong model\n error = strong_predictions - y\n\n # The weak model is a decision tree (see CART chapter)\n # without pruning and a maximum depth of 3.\n weak_model = tfdf.keras.CartModel(\n task=tfdf.keras.Task.REGRESSION,\n validation_ratio=0.0,\n max_depth=3)\n weak_model.fit(x=x, y=error)\n\n strong_model.append(weak_model)\n\n weak_predictions = weak_model.predict(x)[:,0]\n\n strong_predictions -= weak_predictions\n\nLet's apply this code on the following dataset:\n\n**Figure 25. A synthetic regressive dataset with one numerical feature.**\n\nHere are three plots after the first iteration of the gradient boosting\nalgorithm:\n\n**Figure 26. Three plots after the first iteration.**\n\nNote the following about the plots in Figure 26:\n\n- The first plot shows the predictions of the strong model, which is currently always 0.\n- The second plot shows the error, which is the label of the weak model.\n- The third plot shows the weak model.\n\nThe first weak model is learning a coarse representation of the label and mostly\nfocuses on the left part of the feature space (the part with the most variation,\nand therefore the most error for the constant wrong model).\n\nFollowing are the same plots for another iteration of the algorithm:\n\n**Figure 27. Three plots after the second iteration.**\n\nNote the following about the plots in Figure 27:\n\n- The strong model now contains the prediction of the weak model of the previous iteration.\n- The new error of the strong model is a bit smaller.\n- The new prediction of the weak model now focuses on the right part of the feature space.\n\nWe run the algorithm for 8 more iterations:\n\n**Figure 28. Three plots after the third iteration and the tenth iteration.**\n\nIn Figure 28, note that the prediction of strong model starts to resemble\n[the plot of the dataset](#OriginalPlot).\n\nThese figures illustrate the gradient boosting algorithm using decision trees as\nweak learners. This combination is called **gradient boosted (decision) trees**.\n\nThe preceding plots suggest the essence of gradient boosting. However, this\nexample lacks the following two real-world operations:\n\n- The shrinkage\n- The optimization of leaf values with one step of Newton's method\n\n| **Note:** In practice, there are multiple variants of the gradient boosting algorithm with other operations.\n\nShrinkage\n---------\n\nThe weak model $f_i$ is multiplied by a small value $\\\\nu$ (for example, $\\\\nu =\n0.1$) before being added to the strong model $F_i$. This small value is called\nthe **shrinkage**. In other words, instead of each iteration using the following\nformula:\n\n\\\\\\[\nF_{i+1} = F_i - f_i\n\\\\\\]\n\nEach iteration uses the following formula:\n\n\\\\\\[\nF_{i+1} = F_i - \\\\nu f_i\n\\\\\\]\n\nShrinkage in gradient boosting is analogous to learning rate in neural networks.\nShrinkage controls how fast the strong model is learning, which helps limit\noverfitting. That is, a shrinkage value closer to 0.0 reduces overfitting more\nthan a shrinkage value closer to 1.0.\n\nIn our code above, the shrinkage would be implemented as follows: \n\n shrinkage = 0.1 # 0.1 is a common shrinkage value.\n strong_predictions -= shrinkage * weak_predictions"]]