For the following questions, click the desired arrow to check your answer:
Let's say you're working on an advertising-related machine learning
model and want to predict advertiser spending for January. You
have limits on the amount of data you can store on disk, so you must
use only a subset of available data. You could use all of the most recent
data, which is from the prior month of December. Someone else suggests
you sample data throughout the last year. Which might be better
and why?
Data from the previous month (December)
While this data is more recent, it may be influenced by seasonal
effects of advertiser spending before the December holidays.
Data sampled throughout the year
While this data is old, it's less likely to be influenced by
seasonal effects of advertiser spending before the
December holidays.
You want to show videos that users want to watch. You use videos
they've viewed on YouTube as a label. Is this label direct
or derived?
Derived
This label is derived because it's not the exact prediction you
want to make. Perhaps the user opened the video but closed it shortly
afterwards. This event would count as a view even though the user
didn't watch the video. In some cases, a heuristic like this might
be your only option, but be aware of your label type (direct or
derived) and how it limits your predictions.
Direct
While that label might result in an accurate prediction much of
the time, it is not the exact prediction you want to make.