Datasets: Dividing the original dataset

All good software engineering projects devote considerable energy to testing their apps. Similarly, we strongly recommend testing your ML model to determine the correctness of its predictions.

Training, validation, and test sets

You should test a model against a different set of examples than those used to train the model. As you'll learn a little later, testing on different examples is stronger proof of your model's fitness than testing on the same set of examples. Where do you get those different examples? Traditionally in machine learning, you get those different examples by splitting the original dataset. You might assume, therefore, that you should split the original dataset into two subsets:

Figure 8. A horizontal bar divided into two pieces: ~80% of which
            is the training set and ~20% is the test set.
Figure 8. Not an optimal split.

 

Exercise: Check your intuition

Suppose you train on the training set and evaluate on the test set over multiple rounds. In each round, you use the test set results to guide how to update hyperparameters and the feature set. Can you see anything wrong with this approach? Pick only one answer.
This approach is computationally inefficient. Don't change hyperparameters or feature sets after each round of testing.
Doing many rounds of this procedure might cause the model to implicitly fit the peculiarities of the test set.
This approach is fine. After all, you're training on the training set and evaluating on a separate test set.

Dividing the dataset into two sets is a decent idea, but a better approach is to divide the dataset into three subsets. In addition to the training set and the test set, the third subset is:

  • A validation set performs the initial testing on the model as it is being trained.
Figure 9. A horizontal bar divided into three pieces: 70% of which
            is the training set, 15% the validation set, and 15%
            the test set
Figure 9. A much better split.

Use the validation set to evaluate results from the training set. After repeated use of the validation set suggests that your model is making good predictions, use the test set to double-check your model.

The following figure suggests this workflow. In the figure, "Tweak model" means adjusting anything about the model —from changing the learning rate, to adding or removing features, to designing a completely new model from scratch.

Figure 10. A workflow diagram consisting of the following stages:
            1. Train model on the training set.
            2. Evaluate model on the validation set.
            3. Tweak model according to results on the validation set.
            4. Iterate on 1, 2, and 3, ultimately picking the model that does
               best on the validation set.
            5. Confirm the results on the test set.
Figure 10. A good workflow for development and testing.

The workflow shown in Figure 10 is optimal, but even with that workflow, test sets and validation sets still "wear out" with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence that the model will make good predictions on new data. For this reason, it's a good idea to collect more data to "refresh" the test set and validation set. Starting anew is a great reset.

Exercise: Check your intuition

You shuffled all the examples in the dataset and divided the shuffled examples into training, validation, and test sets. However, the loss value on your test set is so staggeringly low that you suspect a mistake. What might have gone wrong?
Training and testing are nondeterministic. Sometimes, by chance, your test loss is incredibly low. Rerun the test to confirm the result.
By chance, the test set just happened to contain examples that the model performed well on.
Many of the examples in the test set are duplicates of examples in the training set.

Additional problems with test sets

As the previous question illustrates, duplicate examples can affect model evaluation. After splitting a dataset into training, validation, and test sets, delete any examples in the validation set or test set that are duplicates of examples in the training set. The only fair test of a model is against new examples, not duplicates.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender's email address as features. Suppose you divide the data into training and test sets, with an 80-20 split. After training, the model achieves 99% precision on both the training set and the test set. You'd probably expect a lower precision on the test set, so you take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set. The problem is that you neglected to scrub duplicate entries for the same spam email from your input database before splitting the data. You've inadvertently trained on some of your test data.

In summary, a good test set or validation set meets all of the following criteria:

  • Large enough to yield statistically significant testing results.
  • Representative of the dataset as a whole. In other words, don't pick a test set with different characteristics than the training set.
  • Representative of the real-world data that the model will encounter as part of its business purpose.
  • Zero examples duplicated in the training set.

Exercises: Check your understanding

Given a single dataset with a fixed number of examples, which of the following statements is true?
Every example used in testing the model is one less example used in training the model.
The number of examples in the test set must be greater than the number of examples in the validation set or training set.
The number of examples in the test set must be greater than the number of examples in the validation set.
Suppose your test set contains enough examples to perform a statistically significant test. Furthermore, testing against the test set yields low loss. However, the model performed poorly in the real world. What should you do?
Determine how the original dataset differs from real-life data.
Retest on the same test set. The test results might have been an anomaly.
How many examples should the test set contain?
At least 15% of the original dataset.
Enough examples to yield a statistically significant test.