Datasets: Labels

This section focuses on labels.

Direct versus proxy labels

Consider two different kinds of labels:

  • Direct labels, which are labels identical to the prediction your model is trying to make. That is, the prediction your model is trying to make is exactly present as a column in your dataset. For example, a column named bicycle owner would be a direct label for a binary classification model that predicts whether or not a person owns a bicycle.
  • Proxy labels, which are labels that are similar—but not identical—to the prediction your model is trying to make. For example, a person subscribing to Bicycle Bizarre magazine probably—but not definitely—owns a bicycle.

Direct labels are generally better than proxy labels. If your dataset provides a possible direct label, you should probably use it. Oftentimes though, direct labels aren't available.

Proxy labels are always a compromise—an imperfect approximation of a direct label. However, some proxy labels are close enough approximations to be useful. Models that use proxy labels are only as useful as the connection between the proxy label and the prediction.

Recall that every label must be represented as a floating-point number in the feature vector (because machine learning is fundamentally just a huge amalgam of mathematical operations). Sometimes, a direct label exists but can't be easily represented as a floating-point number in the feature vector. In this case, use a proxy label.

Exercise: Check your understanding

Your company wants to do the following:

Mail coupons ("Trade in your old bicycle for 15% off a new bicycle") to bicycle owners.

So, your model must do the following:

Predict which people own a bicycle.

Unfortunately, the dataset doesn't contain a column named bike owner. However, the dataset does contain a column named recently bought a bicycle.

Would recently bought a bicycle be a good proxy label or a poor proxy label for this model?
Good proxy label
The column recently bought a bicycle is a relatively good proxy label. After all, most of the people who buy bicycles now own bicycles. Nevertheless, like all proxy labels, even very good ones, recently bought a bicycle is imperfect. After all, the person buying an item isn't always the person using (or owning) that item. For example, people sometimes buy bicycles as a gift.
Poor proxy label
Like all proxy labels, recently bought a bicycle is imperfect (some bicycles are bought as gifts and given to others). However, recently bought a bicycle is still a relatively good indicator that someone owns a bicycle.

Human-generated data

Some data is human-generated; that is, one or more humans examine some information and provide a value, usually for the label. For example, one or more meteorologists could examine pictures of the sky and identify cloud types.

Alternatively, some data is automatically-generated. That is, software (possibly, another machine learning model) determines the value. For example, a machine-learning model could examine sky pictures and automatically identify cloud types.

This section explores the advantages and disadvantages of human-generated data.

Advantages

  • Human raters can perform a wide range of tasks that even sophisticated machine learning models may find difficult.
  • The process forces the owner of the dataset to develop clear and consistent criteria.

Disadvantages

  • You typically pay human raters, so human-generated data can be expensive.
  • To err is human. Therefore, multiple human raters might have to evaluate the same data.

Think through these questions to determine your needs:

  • How skilled must your raters be? (For example, must the raters know a specific language? Do you need linguists for dialogue or NLP applications?)
  • How many labeled examples do you need? How soon do you need them?
  • What's your budget?

Always double-check your human raters. For example, label 1000 examples yourself, and see how your results match other raters' results. If discrepancies surface, don't assume your ratings are the correct ones, especially if a value judgment is involved. If human raters have introduced errors, consider adding instructions to help them and try again.