种植决策树

与所有监督式机器学习模型一样，决策树的训练目标是尽可能准确地解释一组训练示例。决策树的最佳训练是一个 NP 难问题。因此，训练通常使用启发词语（一种易于创建的学习算法）来完成，该算法会生成非最优但接近最优的决策树。

用于训练决策树的大多数算法都采用贪心分而治之策略。该算法首先创建一个节点（根节点），然后以贪心的方式递归地扩展决策树。

在每个节点，系统都会评估所有可能的条件并为其评分。算法会选择“最佳”条件，即得分最高的条件。目前，只需知道得分是与任务相关的指标，并且条件的选择是为了尽可能提高该指标。

例如，在 Palmer Penguins 数据集（用于本课程后面的代码示例）中，大多数阿德利企鹅和巴布亚企鹅的嘴长度大于 16 毫米，而大多数颊纹企鹅的嘴较小。因此，条件 bill_length_mm ≥ 16 可以对冠企鹅进行近乎完美的预测，但无法区分帝企鹅和巴布亚企鹅。算法可能会选择此条件。

一个条件导致两个叶子。条件为“bill_length_mm >= 16”。如果是，则为“阿德利企鹅或巴布亚企鹅”。如果没有，则叶子为“Gentoo”。

图 7. 种植树的第一步。

然后，该算法会在两个子节点上独立递归重复。如果找不到任何满足条件的节点，则该节点会变为叶节点。叶子预测是根据示例中最具代表性的标签值确定的。

算法如下：

def train_decision_tree(training_examples):
  root = create_root() # Create a decision tree with a single empty root.
  grow_tree(root, training_examples) # Grow the root node.
  return root

def grow_tree(node, examples):
  condition = find_best_condition(examples) # Find the best condition.

  if condition is None:
    # No satisfying conditions were found, therefore the grow of the branch stops.
    set_leaf_prediction(node, examples)
    return

  # Create two childrens for the node.
  positive_child, negative_child = split_node(node, condition)

  # List the training examples used by each children.
  negative_examples = [example for example in examples if not condition(example)]
  positive_examples = [example for example in examples if condition(example)]

  # Continue the growth of the children.
  grow_tree(negative_child, negative_examples)
  grow_tree(positive_child, positive_examples)

我们来详细了解训练特定决策树的步骤。

第 1 步：创建根目录：

带有问号的节点。