This lesson focuses on the questions you should ask about your data
and model in production systems.
Is each feature helpful?
You should continuously monitor your model to remove features that contribute
little or nothing to the model's predictive ability. If the input data for
that feature abruptly changes, your model's behavior might also abruptly
change in undesirable ways.
Also consider the following related question:
- Does the usefulness of the feature justify the cost of including it?
It is always tempting to add more features to the model. For example,
suppose you find a new feature whose addition makes your model's predictions
slightly better. Slightly better predictions certainly seem better than
slightly worse predictions; however, the extra feature adds to your
maintenance burden.
Is your data source reliable?
Some questions to ask about the reliability of your input data:
- Is the signal always going to be available or is it coming from an
unreliable source? For example:
- Is the signal coming from a server that crashes under heavy load?
- Is the signal coming from humans that go on vacation every August?
- Does the system that computes your model's input data ever change? If so:
- How often?
- How will you know when that system changes?
Consider creating your own copy of the data you receive from the
upstream process. Then, only advance to the next version of the upstream
data when you are certain that it is safe to do so.
Is your model part of a feedback loop?
Sometimes a model can affect its own training data. For example, the
results from some models, in turn, become (directly or indirectly) input
features to that same model.
Sometimes a model can affect another model. For example, consider two
models for predicting stock prices:
- Model A, which is a bad predictive model.
- Model B.
Since Model A is buggy, it mistakenly decides to buy stock in Stock X.
Those purchases drive up the price of Stock X. Model B uses the price
of Stock X as an input feature, so Model B can come to some false
conclusions about the value of Stock X. Model B could, therefore,
buy or sell shares of Stock X based on the buggy behavior of Model A.
Model B's behavior, in turn, can affect Model A, possibly triggering a
tulip mania or a slide in
Company X's stock.
Exercise: Check your understanding
Which three of the following models are susceptible to
a feedback loop?
A traffic-forecasting model that predicts congestion at highway exits
near the beach, using beach crowd size as one of its features.
Some beachgoers are likely to base their plans on the traffic
forecast. If there is a large beach crowd and traffic is forecast to be
heavy, many people may make alternative plans. This may depress beach
turnout, resulting in a lighter traffic forecast, which then may
increase attendance, and the cycle repeats.
A book-recommendation model that suggests novels its users may like
based on their popularity (i.e., the number of times the books have been
purchased).
Book recommendations are likely to drive purchases, and these
additional sales will be fed back into the model as input,
making it more likely to recommend these same books in the
future.
A university-ranking model that rates schools in part by their
selectivity—the percentage of students who applied that were
admitted.
The model's rankings may drive additional interest to top-rated
schools, increasing the number of applications they receive. If these
schools continue to admit the same number of students, selectivity will
increase (the percentage of students admitted will go down). This
will boost these schools' rankings, which will further increase
prospective student interest, and so on…
An election-results model that forecasts the winner of a
mayoral race by surveying 2% of voters after the polls have closed.
If the model does not publish its forecast until after the polls have
closed, it is not possible for its predictions to affect voter
behavior.
A housing-value model that predicts house prices, using
size (area in square meters), number of bedrooms, and geographic location
as features.
It is not possible to quickly change a house's location,
size, or number of bedrooms in response to price forecasts,
making a feedback loop unlikely. However, there is potentially
a correlation between size and number of bedrooms (larger homes
are likely to have more rooms) that may need to be teased apart.
A face-attributes model that detects whether a person is smiling
in a photo, which is regularly trained on a database of stock photography
that is automatically updated monthly.
There is no feedback loop here, as model predictions don't have
any impact on the photo database. However, versioning of the input
data is a concern here, as these monthly updates could potentially
have unforeseen effects on the model.