[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eDefining clear business metrics and model metrics is crucial for measuring the success of your machine learning project.\u003c/p\u003e\n"],["\u003cp\u003eBusiness metrics, such as revenue or click-through rate, are the primary indicators of project success and should be quantifiable and focused.\u003c/p\u003e\n"],["\u003cp\u003eWhile model metrics like AUC or F1 score are important for evaluating model quality, they do not guarantee improvement in business metrics.\u003c/p\u003e\n"],["\u003cp\u003eIt's essential to establish a connection between model performance and business outcomes through early user testing and careful monitoring of business metrics.\u003c/p\u003e\n"],["\u003cp\u003eDon't solely focus on model success; prioritize business impact and ensure alignment between model metrics and desired business outcomes.\u003c/p\u003e\n"]]],[],null,["# Measuring success\n\nHow will you know if your ML implementation was worth the work? When should you\nstart celebrating: right after the model goes to production and\nserves its first prediction, or only after a quantitative business metric starts\nmoving in the right direction?\n\nBefore starting a project, it's critical to\ndefine your success metrics and agree on deliverables. You'll need to define and\ntrack the following two types of metrics:\n\n- **Business metrics.** Metrics for quantifying business performance, for\n example, revenue, click-through rate, or number of users.\n\n- **Model metrics.** [Metrics](/machine-learning/glossary#metric)\n for quantifying model quality, for example,\n [Root Mean Squared Error](/machine-learning/glossary#root-mean-squared-error-rmse),\n [precision](/machine-learning/glossary#precision), or\n [recall](/machine-learning/glossary#recall).\n\nBusiness metrics\n----------------\n\nBusiness metrics are the most important. They're the reason you're using ML: you\nwant to improve the business.\n\nStart with quantifiable product or business metrics. The metric should be as\ngranular and focused as possible. The following are examples of focused,\nquantifiable business metrics:\n\n- Reduce a datacenter's monthly electric costs by 30 percent.\n- Increase revenue from product recommendations by 12 percent.\n- Increase click-through rate by 9 percent.\n- Increase customer sentiment from opt-in surveys by 20 percent.\n- Increase time on page by 4 percent.\n\n| **Important:** The metric you care about might change. For example, you might want to optimize revenue from recommendations. After launching, revenue might increase, but subscriptions to your service might decrease. In this case, you might reevaluate your success metric as the goal becomes clearer.\n\n### Tracking business metrics\n\nIf you're not tracking the business metric you want to improve, start by\nimplementing the infrastructure to do so. Setting a goal to increase\nclick-through rate 15% isn't logical if you're not currently\nmeasuring click-through rates.\n\nMore importantly, make sure you're measuring the right metric for your problem.\nFor instance, don't spend time writing instrumentation to track click-through\nrates if the more important metric might be revenue from recommendations.\n\nAs your project progresses, you'll realize whether or not the target success\nmetric is actually a realistic target. In some cases, you might determine the\nproject isn't viable given the defined success metrics.\n\nModel metrics\n-------------\n\nWhen should you put the model into production? When\n[AUC](/machine-learning/glossary#auc-area-under-the-roc-curve) is\nat a certain value? When the model reaches a particular\n[F1 score](/machine-learning/glossary#f1)? The answer to this question\ndepends upon the type of problem you're solving and the prediction quality you\nthink you need to improve the business metric.\n\nWhen determining what metrics to evaluate your model against, consider the\nfollowing:\n\n- **Determine a single metric to optimize** . For example, classification models\n can be evaluated against a variety of metrics\n ([AUC](/machine-learning/glossary#auc-area-under-the-roc-curve),\n [AUC-PR](/machine-learning/glossary#pr-auc-area-under-the-pr-curve),\n etc). Choosing the best model can be challenging when different\n metrics favor different models. Therefore, agree on a single metric to\n evaluate models against.\n\n- **Determine acceptability goals to meet**. Acceptability goals are\n different from model evaluation metrics. They refer to goals a model needs\n to meet to be considered acceptable for an intended use case. For\n example, an acceptability goal might be \"incorrect output is less\n than 0.1%,\" or \"recall for the top five\n categories is greater than 97%.\"\n\nFor example, suppose a\n[binary classification model](/machine-learning/glossary#binary-classification)\ndetects fraudulent transactions. Its optimization metric might be recall while\nits acceptability goal might be precision. In other words, we'd prioritize\nrecall (correctly identifying fraud most of the time) while wanting precision to\nstay at or above a particular value (identifying real fraudulent transactions).\n\n### Connection between model metrics and business metrics\n\nFundamentally, you're trying to develop a model whose prediction quality is\ncausally connected to your business metric. Great model metrics don't\nnecessarily imply improved business metrics. Your team might develop a model\nwith impressive metrics, but the model's predictions might fail to improve the\nbusiness metric.\n\nWhen you're satisfied with your model's prediction quality, try to determine how\nthe model's metrics affect the business metric. Typically teams will deploy the\nmodel to 1% of users and then monitor the business metric.\n\nFor instance, let's say your team develops a model to increase revenue by\npredicting customer churn. In theory, if you can predict whether or not a\ncustomer is likely to leave the platform, you can encourage them to stay.\n\nYour team creates a model with 95% prediction quality and tests it on a small\nsample of users. However, revenue doesn't increase. Customer churn actually\nincreases. Here are some possible explanations:\n\n- **Predictions don't occur early enough to be actionable**. The model can\n only predict customer churn within a seven-day timeframe, which isn't\n soon enough to offer incentives to keep them on the platform.\n\n- **Incomplete features**. Maybe other factors contribute to customer churn\n that weren't in the training dataset.\n\n- **Threshold isn't high enough**. The model might need to have a prediction\n quality of 97% or higher for it to be useful.\n\nThis simple example highlights the two points:\n\n- It's important to perform early user testing to prove (and understand) the connection between the model's metrics and the business metrics.\n- Great model metrics don't guarantee improved business metrics.\n\n### Generative AI\n\nEvaluating generative AI output presents unique challenges. In many cases,\nlike open-ended or creative output,\nit's more difficult than evaluating traditional ML outputs.\n\nLLMs can be measured and evaluated against a variety of metrics. Determining\nwhich metrics to evaluate your model against depends on your use case.\n\n\nKeep in mind\n------------\n\nDon't confuse model success with business success. In other words, a model with\noutstanding metrics doesn't guarantee business success.\n\nMany skilled engineers can create models with impressive metrics. Training a\ngood-enough model isn't typically the issue. Rather, it's that the model doesn't\nimprove the business metric. An ML project can be destined for failure from a\nmisalignment between business metrics and model metrics.\n\n### Check Your Understanding\n\nYou have a clear business problem and a well-defined solution for using an LLM as a customer support agent. How should you think about measuring whether the solution is successful? \nThe number of resolved support cases requiring human involvement decreases from 72% to 50%. \nCorrect. This is a quantifiable business metric that you can track. \nThe LLM's evaluation metrics are consistently high. \nGood model metrics doesn't guarantee that you'll have improved business metrics. \nFeedback from initial user testing is very favorable. \nEarly user feedback is typically more qualitative than quantitative. You'll need to determine a quantifiable business metric for measuring success.\n| **Key Terms:**\n|\n| |-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------|\n| | - [AUC](/machine-learning/glossary#auc-area-under-the-roc-curve) | - [AUC-PR](/machine-learning/glossary#pr-auc-area-under-the-pr-curve) |\n| | - [binary classification](/machine-learning/glossary#binary-classification) | - [F1 score](/machine-learning/glossary#f1) |\n| | - [metric](/machine-learning/glossary#metric) | - [precision](/machine-learning/glossary#precision) |\n| | - [recall](/machine-learning/glossary#recall) | - [root mean squared error](/machine-learning/glossary#root-mean-squared-error-rmse) |\n|"]]