适合决策森林的数据
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
当您拥有表格数据集(您可能在电子表格、CSV 文件或数据库表中表示的数据)时,决策树最为有效。表格数据是最常见的数据格式之一,决策树应是用于对其进行建模的“首选”解决方案。
表 1. 表格数据集示例。
航段数 |
眼睛数量 |
重量(磅) |
物种(标签) |
2 | 2 | 12 | 企鹅 |
8 | 6 | 0.1 | 蜘蛛 |
4 | 2 | 44 | 狗 |
… | … | … | … |
与神经网络不同,决策树会原生使用模型表格数据。在开发决策森林时,您无需执行以下任务:
- 执行特征归一化或独热编码等预处理。
- 执行插值(例如,将缺失值替换为
-1
)。
不过,决策树不适合直接使用非表格数据(也称为非结构化数据),例如图片或文本。是的,确实存在针对此限制的权宜解决方法,但神经网络通常能更好地处理非结构化数据。
决策森林对样本的利用率很高。也就是说,决策树非常适合在小数据集上进行训练,或者在特征数 / 示例数比率较高(可能大于 1)的数据集上进行训练。虽然决策树对样本的利用效率很高,但与所有机器学习模型一样,决策树在有大量数据可用时效果最好。
决策树通常比类似的神经网络推理速度更快。例如,中等规模的决策森林可以在新型 CPU 上几微秒内完成推理。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eDecision forests are highly effective for modeling tabular data, making them a primary choice for datasets commonly found in spreadsheets, CSV files, or databases.\u003c/p\u003e\n"],["\u003cp\u003eUnlike neural networks, decision forests directly handle tabular data without requiring preprocessing steps like feature normalization or imputation.\u003c/p\u003e\n"],["\u003cp\u003eWhile decision forests can be adapted for non-tabular data like images or text, neural networks are generally better suited for such data types.\u003c/p\u003e\n"],["\u003cp\u003eDecision forests are sample efficient, performing well even with small datasets or those with a high feature-to-example ratio, but still benefit from larger datasets.\u003c/p\u003e\n"],["\u003cp\u003eDecision forests offer faster inference speeds compared to neural networks, typically completing predictions within microseconds on modern CPUs.\u003c/p\u003e\n"]]],[],null,["# Appropriate data for decision forests\n\n\u003cbr /\u003e\n\nDecision forests are most effective when you have a tabular dataset (data you\nmight represent in a spreadsheet, csv file, or database table). Tabular data is\none of the most common data formats, and decision forests should be your \"go-to\"\nsolution for modeling it.\n\n**Table 1. An example of a tabular dataset.**\n\n| Number of legs | Number of eyes | Weight (lbs) | Species (label) |\n|----------------|----------------|--------------|-----------------|\n| 2 | 2 | 12 | Penguin |\n| 8 | 6 | 0.1 | Spider |\n| 4 | 2 | 44 | Dog |\n| ... | ... | ... | ... |\n\nUnlike neural networks, decision forests natively consume model tabular data.\nWhen developing decision forests, you don't have to do tasks like the following:\n\n- Perform preprocessing like feature normalization or one-hot encoding.\n- Perform imputation (for example, replacing a missing value with `-1`).\n\nHowever, decision forests are not well suited to directly consume non-tabular\ndata (also called unstructured data), such as images or text. Yes, workarounds\nfor this limitation do exist, but neural networks generally handle unstructured\ndata better.\n\nPerformance\n-----------\n\nDecision forests are sample efficient. That is, decision forests are well suited\nfor training on small datasets, or on datasets where the ratio of number of\nfeatures / number of examples is high (possibly greater than 1). Even though\ndecision forests are sample efficient, like all machine learning models,\ndecision forests perform best when lots of data is available.\n\nDecision forests typically infer faster than comparable neural\nnetworks. For example, a medium-size decision forest runs inference in a few\nmicroseconds on a modern CPU."]]