Measuring success

How will you know if your ML implementation was worth the work? When should you start celebrating: right after the model goes to production and serves its first prediction, or only after a quantitative business metric starts moving in the right direction?

Before starting a project, it's critical to define your success metrics and agree on deliverables. You'll need to define and track the following two types of metrics:

Business metrics. Metrics for quantifying business performance, for example, revenue, click-through rate, or number of users.
Model metrics. Metrics for quantifying model quality, for example, Root Mean Squared Error, precision, or recall.

Business metrics

Business metrics are the most important. They're the reason you're using ML: you want to improve the business.

Start with quantifiable product or business metrics. The metric should be as granular and focused as possible. The following are examples of focused, quantifiable business metrics:

Reduce a datacenter's monthly electric costs by 30 percent.
Increase revenue from product recommendations by 12 percent.
Increase click-through rate by 9 percent.
Increase customer sentiment from opt-in surveys by 20 percent.
Increase time on page by 4 percent.

Tracking business metrics

If you're not tracking the business metric you want to improve, start by implementing the infrastructure to do so. Setting a goal to increase click-through rate 15% isn't logical if you're not currently measuring click-through rates.

More importantly, make sure you're measuring the right metric for your problem. For instance, don't spend time writing instrumentation to track click-through rates if the more important metric might be revenue from recommendations.

As your project progresses, you'll realize whether or not the target success metric is actually a realistic target. In some cases, you might determine the project isn't viable given the defined success metrics.

Model metrics

When should you put the model into production? When AUC is at a certain value? When the model reaches a particular F1 score? The answer to this question depends upon the type of problem you're solving and the prediction quality you think you need to improve the business metric.

When determining what metrics to evaluate your model against, consider the following:

Determine a single metric to optimize. For example, classification models can be evaluated against a variety of metrics (AUC, AUC-PR, etc). Choosing the best model can be challenging when different metrics favor different models. Therefore, agree on a single metric to evaluate models against.
Determine acceptability goals to meet. Acceptability goals are different from model evaluation metrics. They refer to goals a model needs to meet to be considered acceptable for an intended use case. For example, an acceptability goal might be "incorrect output is less than 0.1%," or "recall for the top five categories is greater than 97%."

For example, suppose a binary classification model detects fraudulent transactions. Its optimization metric might be recall while its acceptability goal might be precision. In other words, we'd prioritize recall (correctly identifying fraud most of the time) while wanting precision to stay at or above a particular value (identifying real fraudulent transactions).

Connection between model metrics and business metrics

Fundamentally, you're trying to develop a model whose prediction quality is causally connected to your business metric. Great model metrics don't necessarily imply improved business metrics. Your team might develop a model with impressive metrics, but the model's predictions might fail to improve the business metric.

When you're satisfied with your model's prediction quality, try to determine how the model's metrics affect the business metric. Typically teams will deploy the model to 1% of users and then monitor the business metric.

For instance, let's say your team develops a model to increase revenue by predicting customer churn. In theory, if you can predict whether or not a customer is likely to leave the platform, you can encourage them to stay.

Your team creates a model with 95% prediction quality and tests it on a small sample of users. However, revenue doesn't increase. Customer churn actually increases. Here are some possible explanations:

Predictions don't occur early enough to be actionable. The model can only predict customer churn within a seven-day timeframe, which isn't soon enough to offer incentives to keep them on the platform.
Incomplete features. Maybe other factors contribute to customer churn that weren't in the training dataset.
Threshold isn't high enough. The model might need to have a prediction quality of 97% or higher for it to be useful.

This simple example highlights the two points:

It's important to perform early user testing to prove (and understand) the connection between the model's metrics and the business metrics.
Great model metrics don't guarantee improved business metrics.

Generative AI

Evaluating generative AI output presents unique challenges. In many cases, like open-ended or creative output, it's more difficult than evaluating traditional ML outputs.

LLMs can be measured and evaluated against a variety of metrics. Determining which metrics to evaluate your model against depends on your use case.

Keep in mind

Don't confuse model success with business success. In other words, a model with outstanding metrics doesn't guarantee business success.

Many skilled engineers can create models with impressive metrics. Training a good-enough model isn't typically the issue. Rather, it's that the model doesn't improve the business metric. An ML project can be destined for failure from a misalignment between business metrics and model metrics.

Check Your Understanding

You have a clear business problem and a well-defined solution for using an LLM as a customer support agent. How should you think about measuring whether the solution is successful?

The number of resolved support cases requiring human involvement decreases from 72% to 50%.

Correct. This is a quantifiable business metric that you can track.

The LLM's evaluation metrics are consistently high.

Good model metrics doesn't guarantee that you'll have improved business metrics.

Feedback from initial user testing is very favorable.

Early user feedback is typically more qualitative than quantitative. You'll need to determine a quantifiable business metric for measuring success.

Planning

Experiments

AUC	AUC-PR
binary classification	F1 score
metric	precision
recall	root mean squared error