In this lesson, you'll debug a real-world ML problem* related to 18th century literature.
Real World Example: 18th Century Literature
Real World Example: 18th Century Literature
- Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.

Real World Example: 18th Century Literature
- Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
- Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.

Real World Example: 18th Century Literature
- Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
- Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.
- Trained model did nearly perfectly on test data, but researchers felt results were suspiciously accurate. What might have gone wrong?

Real World Example: 18th Century Literature
Why do you think test accuracy was suspiciously high? See if you can figure out the problem, and then click the Play button ▶ below to find out if you're correct.
Real World Example: 18th Century Literature
- Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
All of Richardson's examples might be in the training set, while all of Swift's examples might be in the validation set.
Real World Example: 18th Century Literature
- Data Split B: Researchers put all of each author's examples in a single set.
Real World Example: 18th Century Literature
- Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
- Data Split B: Researchers put all of each author's examples in a single set.
- Results: The model trained on Data Split A had much higher accuracy than the model trained on Data Split B.
Real World Example: 18th Century Literature
The moral: carefully consider how you split examples.
Know what the data represents.
* We based this module very loosely (making some modifications along the way) on "Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities" by Sculley and Pasanek.