The previous unit introduced the following model, which miscategorized a lot
of trees in the test set:
The preceding model contains a lot of complex shapes. Would a simpler
model handle new data better? Suppose you replace the complex model with
a ridiculously simple model--a straight line.
The simple model generalizes better than the complex model on new data. That is,
the simple model made better predictions on the test set than the complex model.
Simplicity has been beating complexity for a long time. In fact, the
preference for simplicity dates back to ancient Greece. Centuries later,
a fourteenth-century friar named William of Occam formalized the preference
for simplicity in a philosophy known as Occam's
razor. This philosophy
remains an essential underlying principle of many sciences, including
machine learning.
Exercises: Check your understanding
You are developing a physics equation. Which of the following formulas
conform more closely to Occam's Razor?
A formula with three variables.
Three variables is more Occam-friendly than twelve variables.
A formula with twelve variables.
Twelve variables seems overly complicated, doesn't it?
The two most famous physics formulas of all time (F=ma and
E=mc2) each involve only three variables.
You're on a brand-new machine learning project, about to select your
first features. How many features should you pick?
Pick 1–3 features that seem to have strong predictive power.
It's best for your data collection pipeline to start with only one or
two features. This will help you confirm that the ML model works as intended.
Also, when you build a baseline from a couple of features,
you'll feel like you're making progress!
Pick 4–6 features that seem to have strong predictive power.
You might eventually use this many features, but it's still better to
start with fewer. Fewer features usually means fewer unnecessary
complications.
Pick as many features as you can, so you can start observing which
features have the strongest predictive power.
Start smaller. Every new feature adds a new dimension to your training
dataset. When the dimensionality increases, the volume of the space
increases so fast that the available training data become sparse. The
sparser your data, the harder it is for a model to learn the relationship
between the features that actually matter and the label. This phenomenon
is called "the curse of dimensionality."
Regularization
Machine learning models must simultaneously meet two conflicting goals:
Fit data well.
Fit data as simply as possible.
One approach to keeping a model simple is to penalize complex models; that is,
to force the model to become simpler during training. Penalizing complex
models is one form of regularization.
Loss and complexity
So far, this course has suggested that the only goal when training was to
minimize loss; that is:
$$\text{minimize(loss)}$$
As you've seen, models focused solely on minimizing loss tend to overfit.
A better training optimization algorithm minimizes some combination of
loss and complexity:
$$\text{minimize(loss + complexity)}$$
Unfortunately, loss and complexity are typically inversely related. As
complexity increases, loss decreases. As complexity decreases, loss increases.
You should find a reasonable middle ground where the model makes good
predictions on both the training data and real-world data.
That is, your model should find a reasonable compromise
between loss and complexity.
What is complexity?
You've already seen a few different ways of quantifying loss. How would
you quantify complexity? Start your exploration through the following exercise:
Exercise: Check your intuition
So far, we've been pretty vague about what complexity actually
is. Which of the following ideas do you think would be reasonable
complexity metrics?
Complexity is a function of the model's weights.
Yes, this is one way to measure some models' complexity.
This metric is called
L1 regularization.
Complexity is a function of the square of the model's weights.
Yes, you can measure some models' complexity this way. This metric
is called
L2 regularization.
Complexity is a function of the biases of all the features in the
model.