其他主题
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
本单元将介绍以下主题:
解读随机森林
与决策树相比,随机森林的解读更为复杂。随机森林包含使用随机噪声训练的决策树。因此,对决策树结构做出判断更难。不过,我们可以通过多种方式解读随机森林模型。
解释随机森林的方法之一是,只需使用 CART 算法训练和解释决策树即可。由于随机森林和 CART 都是使用相同的核心算法进行训练的,因此它们“共享数据集的相同全局视图”。此选项非常适合简单的数据集,并且有助于理解模型的整体解读。
变量重要性是另一种可解释性方法。例如,下表对基于美国人口普查局数据集(也称为成人)训练的随机森林模型的不同特征的变量重要性进行了排名。
表 8. 14 个不同特征的变量重要性。
功能 |
总分 |
准确度平均下降幅度 |
AUC 平均下降幅度 |
平均最小深度 |
节点数 |
PR-AUC 平均下降幅度 |
以 root 身份运行 Num |
关系 |
4203592.6
|
0.0045
|
0.0172
|
4.970
|
57040
|
0.0093
|
1095
|
capital_gain
|
3363045.1
|
0.0199
|
0.0194
|
2.852
|
56468
|
0.0655
|
457
|
marital_status
|
3128996.3
|
0.0018
|
0.0230
|
6.633
|
52391
|
0.0107
|
750
|
年龄
|
2520658.8
|
0.0065
|
0.0074
|
4.969
|
356784
|
0.0033
|
200
|
教育
|
2015905.4
|
0.0018
|
-0.0080
|
5.266
|
115751
|
-0.0129
|
205
|
职业
|
1939409.3
|
0.0063
|
-0.0040
|
5.017
|
221935
|
-0.0060
|
62
|
education_num
|
1673648.4
|
0.0023
|
-0.0066
|
6.009
|
58303
|
-0.0080
|
197
|
fnlwgt
|
1564189.0
|
-0.0002
|
-0.0038
|
9.969
|
431987
|
-0.0049
|
0
|
hours_per_week
|
1333976.3
|
0.0030
|
0.0007
|
6.393
|
206526
|
-0.0031
|
20
|
capital_loss
|
866863.8
|
0.0060
|
0.0020
|
8.076
|
58531
|
0.0118
|
1
|
workclass
|
644208.4
|
0.0025
|
-0.0019
|
9.898
|
132196
|
-0.0023
|
0
|
native_country
|
538841.2
|
0.0001
|
-0.0016
|
9.434
|
67211
|
-0.0058
|
0
|
sex
|
226049.3
|
0.0002
|
0.0002
|
10.911
|
37754
|
-0.0011
|
13
|
race
|
168180.9
|
-0.0006
|
-0.0004
|
11.571
|
42262
|
-0.0031
|
0
|
如您所见,不同变量重要性定义具有不同的量级,可能会导致特征排名出现差异。
对于决策树(请参阅“购物车 | 变量重要性”部分)和随机森林,系统会以类似的方式计算来自模型结构的变量重要性(例如,上表中的总得分、平均最小深度、节点数和作为根的节点数)。
排列变量重要性(例如,上表中 {accuracy, auc, pr-auc} 的平均下降幅度)是与模型无关的衡量指标,可对具有验证数据集的任何机器学习模型进行计算。不过,对于随机森林,您可以使用袋外评估来计算排列变量重要性,而无需使用验证数据集。
SHAP(SHapley Additive exPlanations)是一种与模型无关的方法,用于解释个别预测或模型级解释。(如需简要了解与模型无关的解释,请参阅 Molnar 的《可解释机器学习》。)通常,计算 SHAP 的开销很高,但对于决策树,可以显著加快计算速度,因此它是解读决策树的好方法。
用法示例
在上一课中,我们通过调用 tfdf.keras.CartModel
在小型数据集上训练了 CART 决策树。如需训练随机森林模型,只需将 tfdf.keras.CartModel
替换为 tfdf.keras.RandomForestModel
:
model = tfdf.keras.RandomForestModel()
model.fit(tf_train_dataset)
优缺点
本部分简要介绍了随机森林的优缺点。
优点:
- 与决策树一样,随机森林支持原生数值特征和分类特征,通常不需要特征预处理。
- 由于决策树是独立的,因此可以并行训练随机森林。因此,您可以快速训练随机森林。
- 随机森林具有默认参数,通常可提供出色的结果。调整这些参数通常对模型没有太大影响。
缺点:
- 由于决策树不会被修剪,因此它们可能很大。节点数超过 100 万的模型很常见。随机森林的大小(以及推理速度)有时会成为问题。
- 随机森林无法学习和重复使用内部表示法。每个决策树(以及每个决策树的每个分支)都必须重新学习数据集模式。在某些数据集中(尤其是非表格数据集,例如图片、文本),这会导致随机森林的结果不如其他方法。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eRandom forests are harder to interpret than decision trees but can be understood using variable importance or by training a single decision tree for a simplified view.\u003c/p\u003e\n"],["\u003cp\u003eRandom forests utilize decision trees trained with random noise, making them robust and less prone to overfitting.\u003c/p\u003e\n"],["\u003cp\u003eThey natively support numerical and categorical features, often requiring minimal feature preprocessing.\u003c/p\u003e\n"],["\u003cp\u003eRandom forests can be trained quickly due to the independence of decision trees, allowing for parallel processing.\u003c/p\u003e\n"],["\u003cp\u003eWhile they often perform well with default parameters, large model size can impact inference speed and memory usage.\u003c/p\u003e\n"]]],[],null,["# Other topics\n\n\u003cbr /\u003e\n\nThis unit examines the following topics:\n\n- interpreting random forests\n- training random forests\n- pros and cons of random forests\n\nInterpreting random forests\n---------------------------\n\nRandom forests are more complex to interpret than decision trees. Random forests\ncontain decision trees trained with random noise. Therefore, it is harder to\nmake judgments on the decision tree structure. However, we can interpret random\nforest models in a couple of ways.\n\nOne approach to interpret a random forest is simply to train and interpret a\ndecision tree with the CART algorithm. Because both random forest and CART are\ntrained with the same core algorithm, they \"share the same global view\" of the\ndataset. This option works well for simple datasets and to understand the\noverall interpretation of the model.\n\n[Variable importances](#variable-importances) are another good interpretability\napproach. For example, the following table ranks the variable importance of\ndifferent features for a random forest model trained on the\n[Census dataset](https://archive.ics.uci.edu/ml/datasets/census+income) (also\nknown as [Adult](https://archive.ics.uci.edu/ml/datasets/adult)).\n\n**Table 8. Variable importance of 14 different features.**\n\n| Feature | Sum score | Mean decrease in accuracy | Mean decrease in AUC | Mean min depth | Num nodes | Mean decrease in PR-AUC | **Num as root** |\n|----------------|-----------|---------------------------|----------------------|----------------|-----------|-------------------------|-----------------|\n| relationship | 4203592.6 | 0.0045 | 0.0172 | 4.970 | 57040 | 0.0093 | 1095 |\n| capital_gain | 3363045.1 | 0.0199 | 0.0194 | 2.852 | 56468 | 0.0655 | 457 |\n| marital_status | 3128996.3 | 0.0018 | 0.0230 | 6.633 | 52391 | 0.0107 | 750 |\n| age | 2520658.8 | 0.0065 | 0.0074 | 4.969 | 356784 | 0.0033 | 200 |\n| education | 2015905.4 | 0.0018 | -0.0080 | 5.266 | 115751 | -0.0129 | 205 |\n| occupation | 1939409.3 | 0.0063 | -0.0040 | 5.017 | 221935 | -0.0060 | 62 |\n| education_num | 1673648.4 | 0.0023 | -0.0066 | 6.009 | 58303 | -0.0080 | 197 |\n| fnlwgt | 1564189.0 | -0.0002 | -0.0038 | 9.969 | 431987 | -0.0049 | 0 |\n| hours_per_week | 1333976.3 | 0.0030 | 0.0007 | 6.393 | 206526 | -0.0031 | 20 |\n| capital_loss | 866863.8 | 0.0060 | 0.0020 | 8.076 | 58531 | 0.0118 | 1 |\n| workclass | 644208.4 | 0.0025 | -0.0019 | 9.898 | 132196 | -0.0023 | 0 |\n| native_country | 538841.2 | 0.0001 | -0.0016 | 9.434 | 67211 | -0.0058 | 0 |\n| sex | 226049.3 | 0.0002 | 0.0002 | 10.911 | 37754 | -0.0011 | 13 |\n| race | 168180.9 | -0.0006 | -0.0004 | 11.571 | 42262 | -0.0031 | 0 |\n\nAs you see, different definitions of variable importances have different scales\nand can lead to differences in the ranking of the features.\n\nVariable importances that come from the model structure (for example, sum\nscore, mean min depth, num nodes and num as root in the table above) are\ncomputed similarly for decision trees (see section \"Cart \\| Variable importance\")\nand random forests.\n\nPermutation variable importance (for example, mean decrease in {accuracy, auc,\npr-auc} in the table above) are model agnostic measures that can be computed on\nany machine learning model with a validation dataset. With random forest,\nhowever, instead of using a validation dataset, you can compute permutation\nvariable importance with out-of-bag evaluation.\n\n**SHAP** (**SHapley Additive exPlanations** ) is a model agnostic method to\nexplain individual predictions or model-wise interpretation. (See\n[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/)\nby Molnar for an introduction to model agnostic interpretation.) SHAP is\nordinarily expensive to compute but can be\n[speeded-up significantly](https://arxiv.org/abs/1802.03888) for decision\nforests, so it is a good way to interpret decision forests.\n\nUsage example\n-------------\n\nIn the previous lesson, we trained a CART decision tree on a small dataset\nby calling `tfdf.keras.CartModel`. To train a random forest model,\nsimply replace `tfdf.keras.CartModel` with `tfdf.keras.RandomForestModel`: \n\n model = tfdf.keras.RandomForestModel()\n model.fit(tf_train_dataset)\n\nPros and cons\n-------------\n\nThis section contains a quick summary of the pros and cons of random forests.\n\n**Pros:**\n\n- Like decision trees, random forests support natively numerical and categorical features and often do not need feature pre-processing.\n- Because the decision trees are independent, random forests can be trained in parallel. Consequently, you can train random forests quickly.\n- Random forests have default parameters that often give great results. Tuning those parameters often has little effect on the model.\n\n**Cons:**\n\n- Because decision trees are not pruned, they can be large. Models with more than 1M nodes are common. The size (and therefore inference speed) of the random forest can sometimes be an issue.\n- Random forests cannot learn and reuse internal representations. Each decision tree (and each branch of each decision tree) must relearn the dataset pattern. In some datasets, notably non-tabular dataset (e.g. image, text), this leads random forests to worse results than other methods."]]