Framing an ML problem

After verifying that your problem is best solved using ML and that you have access to the data you'll need, you're ready to frame your problem in ML terms. You frame a problem in ML terms by completing the following tasks:

  • Define the ideal outcome and the model's goal.
  • Identify the model's output.
  • Define success metrics.

Define the ideal outcome and the model's goal

Independent of the ML model, what's the ideal outcome? In other words, what is the exact task you want your product or feature to perform? This is the same statement you previously defined in the State the goal section.

Connect the model's goal to the ideal outcome by explicitly defining what you want the model to do. The following table states the ideal outcomes and the model's goal for hypothetical apps:

App Ideal outcome Model's goal
Weather app Calculate precipitation in six hour increments for a geographic region. Predict six-hour precipitation amounts for specific geographic regions.
Video app Recommend useful videos. Predict whether a user will click on a video.
Mail app Detect spam. Warn the user if the email appears to be spam.
Map app Calculate travel time. Predict how long it will take to travel between two points.
Banking app Identify fraudulent transactions. Predict if a transaction was made by the card holder.
Dining app Identify cuisine by a restaurant's menu. Predict the type of restaurant.

Choose the right kind model

Your choice of model type depends upon the specific context and constraints of your problem.

A classification model predicts what category the input data belongs to, for example, whether an input should be classified as A, B, or C.

A classification model is making predictions.

Figure 1. A classification model making predictions.

Based on the model's prediction, your app might make a decision. For example, if the prediction is category A, then do X; if the prediction is category B, then do, Y; if the prediction is category C, then do Z. In some cases, the prediction is the app's output.

The product code uses model's output to make a decision.

Figure 2. A classification model's output being used in the product code to make a decision.

A regression model predicts where to place the input data on a number line.

A regression model is making a prediction.

Figure 3. A regression model making a numeric prediction.

Based on the model's prediction, your app might make a decision. For example, if the prediction falls within range A, do X; if the prediction falls within range B, do Y; if the prediction falls within range C, do Z. In some cases, the prediction is the app's output.

The product code uses the model's output to make a decision.

Figure 4. A regression model's output being used in the product code to make a decision.

Consider the following scenario:

You want to cache videos based on their predicted popularity. In other words, if your model predicts that a video will be popular, you want to quickly serve it to users. To do so, you'll use the more effective and expensive cache. For other videos, you'll use a different cache. Your caching criteria is the following:

  • If a video is predicted to get 50 or more views, you'll use the expensive cache.
  • If a video is predicted to get between 30 and 50 views, you'll use the cheap cache.
  • If the video is predicted to get less than 30 views, you won't cache the video.

You think a regression model is the right approach because you'll be predicting a numeric value—the number of views. However, when training the regression model, you realize that it produces the same loss for a prediction of 28 and 32 for videos that have 30 views. In other words, although your app will have very different behavior if the prediction is 28 versus 32, the model considers both predictions equally good.

A model being trained and its loss evaluated.

Figure 5. Training a regression model.

Regression models are unaware of product-defined thresholds. Therefore, if your app's behavior changes significantly because of small differences in a regression model's predictions, you should consider implementing a classification model instead.

In this scenario, a classification model would produce the correct behavior because a classification model would produce a higher loss for a prediction of 28 than 32. In a sense, classification models produce thresholds by default.

This scenario highlights two important points:

  • Predict the decision. When possible, predict the decision your app will take. In the video example, a classification model would predict the decision if the categories it classified videos into were "no cache," "cheap cache," and "expensive cache." Hiding your app's behavior from the model can cause your app to produce the wrong behavior.

  • Understand the problem's constraints. If your app takes different actions based on different thresholds, determine if those thresholds are fixed or dynamic.

    • Dynamic thresholds: If thresholds are dynamic, use a regression model and set the thresholds limits in your app's code. This lets you easily update the thresholds while still having the model make reasonable predictions.
    • Fixed thresholds: If thresholds are fixed, use a classification model and label your datasets based on the threshold limits.

    In general, most cache provisioning is dynamic and the thresholds change over time. Therefore, because this is specifically a caching problem, a regression model is the best choice. However, for many problems, the thresholds will be fixed, making a classification model the best solution.

Identify the model's output

The model's output should accomplish the task defined in the ideal outcome. If you're using a regression model, the numeric prediction should provide the data needed to accomplish the ideal outcome; if you're using a classification model, the categorical prediction should provide the data needed to accomplish the ideal outcome.

There are several subtypes of classification and regression models. Use the corresponding flowcharts to identify which subtype you are using.

Classification flowchart

A classification flowchart.

Figure 6. Diagram of a classification flowchart.

Regression flowchart

A regression flowchart.

Figure 7. Diagram of a regression flowchart.

In the weather app, the ideal outcome is to tell users how much it will rain in the next six hours. We could use a regression model that predicts the label precipitation_amount.

Ideal outcome Ideal label
Tell users how much it will rain in their area in the next six hours. precipitation_amount

In the weather app example, the label directly addresses the ideal outcome. In some cases, a one-to-one relationship isn't apparent between the ideal outcome and the label. For example, in the video app, the ideal outcome is to recommend useful videos. However, there's no label in the dataset called useful_to_user.

Ideal outcome Ideal label
Recommend useful videos. ?

Therefore, we'll need to find a proxy label.

Proxy labels

Proxy labels substitute for labels that aren't in the dataset. Proxy labels are necessary when you can't directly measure what you want to predict. In the video app, we can't directly measure whether or not a user will find a video useful. It would be great if the dataset had a useful feature, and users marked all the videos that they found useful, but because the dataset doesn't, we'll need a proxy label that substitutes for usefulness.

A proxy label for usefulness might be whether or not the user will share or like the video.

Ideal outcome Proxy label
Recommend useful videos. shared OR liked

Be cautious with proxy labels because they don't directly measure what you want to predict. For example, the following table outlines issues with potential proxy labels for Recommend useful videos:

Proxy label Issue
Predict whether the user will click the “like” button. Most users never click “like.”
Predict whether a video will be popular. Not personalized. Some users might not like popular videos.
Predict whether the user will share the video. Some users don't share videos. Sometimes, people share videos because they don't like them.
Predict whether the user will click play. Maximizes clickbait.
Predict how long they watch the video. Favors long videos differentially over short videos.
Predict how many times the user will rewatch the video. Favors "rewatchable" videos over video genres that aren't rewatchable.

No proxy label can be a perfect substitute for your ideal outcome. All will have potential problems. Pick the one that has the least problems for your use case.

Check Your Understanding

A company wants to use ML in their health and well-being app to help people feel better. Do you think they'll need to use proxy labels to accomplish their goals?
Yes, the company will need to find proxy labels. Categories like happiness and well-being can’t be measured directly. Instead, they need to be approximated with respect to some other feature, like hours spent exercising per week, or time spent engaged in hobbies or with friends.
No, the company won’t need to use proxy labels. Happiness and well-being can be directly measured.

Define the success metrics

Define the metrics you'll use to determine whether or not the ML implementation is successful. Success metrics define what you care about, like engagement or helping users take appropriate action, such as watching videos that they'll find useful. Success metrics differ from the model's evaluation metrics, like accuracy, precision, recall, or AUC.

For example, the weather app's success and failure metrics might be defined as the following:

Success Users open the "Will it rain?" feature 50 percent more often than they did before.
Failure Users open the "Will it rain?" feature no more often than before.

The video app metrics might be defined as the following:

Success Users spend on average 20 percent more time on the site.
Failure Users spend on average no more time on site than before.

We recommend defining ambitious success metrics. High ambitions can cause gaps between success and failure though. For example, users spending on average 10 percent more time on the site than before is neither success nor failure. The undefined gap is not what's important.

What's important is your model's capacity to move closer—or exceed—the definition of success. For instance, when analyzing the model's performance, consider the following question: Would improving the model get you closer to your defined success criteria? For example, a model might have great evaluation metrics, but not move you closer to your success criteria, indicating that even with a perfect model, you would not meet the success criteria you defined. On the other hand, a model might have poor evaluation metrics, but get you closer to your success criteria, indicating that improving the model would get you closer to success.

The following are dimensions to consider when determining if the model is worth improving:

  • Not good enough, but continue. The model shouldn't be used in a production environment, but over time it might be significantly improved.

  • Good enough, and continue. The model could be used in a production environment, and it might be further improved.

  • Good enough, but can't be made better. The model is in a production environment, but it is probably as good as it can be.

  • Not good enough, and never will be. The model should not be used in a production environment and no amount of training will probably get it there.

When deciding to improve the model, re-evaluate if the increase in resources, like engineering time and compute costs, justify the predicted improvement of the model.

After defining the success and failure metrics, you need to determine how often you'll measure them. For instance, you could measure your success metrics six days, six weeks, or six months after implementing the system.

When analyzing failure metrics, try to determine why the system failed. For example, the model might be predicting which videos users will click, but the model might start recommending clickbait titles that cause user engagement to drop off. In the weather app example, the model might accurately predict when it will rain but for too large of a geographic region.

Check Your Understanding

A fashion firm wants to sell more clothes. Someone suggests using ML to determine which clothes the firm should manufacture. They think they can train a model to determine which type of clothes are in fashion. After they train the model, they want to apply it to their catalog to decide which clothes to make.

How should they frame their problem in ML terms?

Ideal outcome: Determine which products to manufacture.

Model’s goal: Predict which articles of clothing are in fashion.

Model output: Binary classification, in_fashion, not_in_fashion

Success metrics: Sell seventy percent or more of the clothes made.

Ideal outcome: Determine how much fabric and supplies to order.

Model’s goal: Predict how much of each item to manufacture.

Model output: Binary classification, make, do_not_make

Success metrics: Sell seventy percent or more of the clothes made.

The ideal outcome isn't to determine how much fabric and supplies to order. It’s to determine if an item should be manufactured. Thus, the model’s goal addresses the wrong objective.