Out-of-bag evaluation
Random forests do not require a validation dataset. Most random forests use a technique called out-of-bag-evaluation (OOB evaluation) to evaluate the quality of the model. OOB evaluation treats the training set as if it were on the test set of a cross-validation.
As explained earlier, each decision tree in a random forest is typically trained on ~67% of the training examples. Therefore, each decision tree does not see ~33% of the training examples. The core idea of OOB-evaluation is as follows:
- To evaluate the random forest on the training set.
- For each example, only use the decision trees that did not see the example during training.
The following table illustrates OOB evaluation of a random forest with 3 decision trees trained on 6 examples. (Yes, this is the same table as in the Bagging section). The table shows which decision tree is used with which example during OOB evaluation.
Table 7. OOB Evaluation - the numbers represent the number of times a given training example is used during training of the given example
Training examples | Examples for OOB Evaluation | ||||||
---|---|---|---|---|---|---|---|
#1 | #2 | #3 | #4 | #5 | #6 | ||
original dataset | 1 | 1 | 1 | 1 | 1 | 1 | |
decision tree 1 | 1 | 1 | 0 | 2 | 1 | 1 | #3 |
decision tree 2 | 3 | 0 | 1 | 0 | 2 | 0 | #2, #4, and #6 |
decision tree 3 | 0 | 1 | 3 | 1 | 0 | 1 | #1 and #5 |
In the example shown in Table 7, the OOB predictions for training example 1 will be computed with decision tree #3 (since decision trees #1 and #2 used this example for training). In practice, on a reasonable size dataset and with a few decision trees, all the examples have an OOB prediction.
compute_oob_performances=True
.
OOB evaluation is also effective to compute permutation variable importance for random forest models. Remember from Variable importances that permutation variable importance measures the importance of a variable by measuring the drop of model quality when this variable is shuffled. The random forest "OOB permutation variable importance" is a permutation variable importance computed using the OOB evaluation.
compute_oob_variable_importances=True
.