实验
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
实验有助于提高项目的可行性。它们是可测试且可重现的假设。运行实验的目标是通过评估各种模型架构和特征,不断进行渐进式改进。进行实验时,您需要执行以下操作:
确定基准效果。首先,建立基准指标。基准值可用作衡量标准,用于与实验进行比较。
在某些情况下,当前的非机器学习解决方案可以提供第一个基准指标。如果目前没有任何解决方案,请创建一个具有简单架构和少量特征的机器学习模型,并将其指标用作基准。
进行单次小幅更改。每次只进行一项小更改,例如对超参数、架构或特征进行更改。如果更改有助于改进模型,该模型的指标将成为未来实验的对照基准。
以下是进行单次小更改的实验示例:
- 包含功能 X。
- 对第一个隐藏层使用 0.5 的 dropout 值。
- 对特征 Y 进行对数转换。
- 将学习速率更改为 0.001。
记录实验进度。您很可能需要进行大量实验。与基准相比,预测质量较差(或中性)的实验仍有价值可循。它们会表明哪些方法行不通。由于进度通常是非线性的,因此除了展示您在提高基准质量方面的进展之外,还应突出显示您发现的所有行不通的方法,以表明您正在努力解决问题。
由于对真实数据集进行的每次完整训练都可能需要数小时(或数天),因此不妨考虑同时运行多个独立实验,以便快速探索搜索空间。随着您不断迭代,希望能越来越接近生产所需的质量水平。
实验结果中的噪声
请注意,实验结果中可能会出现噪声,而这些噪声并非源自模型或数据的更改,这会导致您难以确定所做的更改是否确实改进了模型。以下是可能会在实验结果中产生噪声的示例:
数据洗牌:向模型呈现数据的顺序可能会影响模型的性能。
变量初始化:模型变量的初始化方式也会影响其性能。
异步并行:如果模型是使用异步并行训练的,则模型不同部分的更新顺序也可能会影响其性能。
评估集较小:如果评估集太小,可能无法代表模型的整体性能,从而导致模型质量出现不均匀的变化。
多次运行实验有助于确认实验结果。
统一实验做法
您的团队应明确了解“实验”的确切含义,并制定一套明确的做法和工件。您需要提供一份文档,其中概述了以下内容:
工件。实验的工件是什么?在大多数情况下,实验是指可重现的经过测试的假设,通常通过记录元数据(例如特征和超参数)来指明实验之间的变化及其对模型质量的影响。
编码做法。每个人都会使用自己的实验环境吗?将所有人的成果整合到共享库中有多大可能性(或有多容易)?
可重复性和跟踪。可重复性标准是什么?例如,团队是否应使用相同的数据流水线和版本控制做法,还是只显示图表即可?实验数据将以何种方式保存:作为 SQL 查询还是作为模型快照?每个实验的日志将记录在何处:文档、电子表格还是用于管理实验的 CMS?
错误的预测
没有完美的真实世界模型。您的系统将如何处理错误的预测结果?
请尽早考虑如何应对这些问题。
最佳实践策略鼓励用户正确标记错误的预测结果。例如,电子邮件应用会通过记录用户移至垃圾邮件文件夹的电子邮件以及相反的情况来捕获分类错误的电子邮件。通过从用户获取标准答案标签,您可以设计自动反馈环,以便收集数据和重新训练模型。
请注意,虽然嵌入界面的调查会收集用户反馈,但这些数据通常是定性数据,无法纳入重新训练数据中。
实现端到端解决方案
在团队对模型进行实验时,最好开始构建最终流水线的部分内容(如果有资源)。
建立流水线的不同部分(例如数据提取和模型重新训练)有助于更轻松地将最终模型移至生产环境。例如,获取用于提取数据和提供预测结果的端到端流水线有助于团队开始将模型集成到产品中,并开始进行早期用户测试。
排查项目卡住问题
在某些情况下,项目的进度可能会停滞不前。也许您的团队一直在进行一项有前景的实验,但数周以来一直未能成功改进模型。您应该怎么做?以下是一些可能的方法:
战略。您可能需要重新定义问题。在实验阶段花费一些时间后,您可能更了解问题、数据和可能的解决方案。通过对该领域的更深入了解,您或许可以更精确地界定问题。
例如,您最初可能希望使用线性回归来预测数值。很遗憾,这些数据不足以训练出可行的线性回归模型。进一步分析后,您可能会发现可以通过预测示例是否高于或低于特定值来解决此问题。这样,您就可以将问题重新定义为二元分类问题。
如果进度比预期慢,请不要放弃。随着时间的推移逐步改进可能是解决此问题的唯一方法。如前所述,不要期望每周的进度相同。通常,获得可用于生产环境的模型版本需要花费大量时间。模型改进可能不规律且不可预测。进展缓慢的时期可能会出现进步突增,反之亦然。
技术。花时间诊断和分析错误的预测结果。在某些情况下,您可以通过隔离一些错误的预测并诊断这些实例中的模型行为来发现问题。例如,您可能会发现架构或数据存在问题。在其他情况下,获取更多数据可能会有所帮助。您可能会获得更清晰的信号,表明您走在正确的道路上,但也可能会产生更多噪声,表明该方法存在其他问题。
如果您要解决的问题需要人工标记的数据集,那么获得用于模型评估的标记数据集可能很难。查找资源,获取评估所需的数据集。
也许无法找到解决方案。为您的方法设置时间限制,如果您未在规定时间内取得进展,则停止。不过,如果您有清晰的问题陈述,那么就可能需要解决方案。
检查您的理解情况
团队成员找到了一种超参数组合,可以提升基准模型指标。团队的其他成员应该做些什么?
可以引入一个超参数,但继续进行实验。
正确。如果其中某个超参数似乎是一个合理的选择,请尝试使用它。不过,并非所有超参数选项在每种实验环境中都适用。
更改当前实验中的所有超参数,使其与同事的超参数一致。
对一个模型有益的超参数并不一定对另一个模型也有益。其他团队成员应继续进行实验,这实际上可能会进一步提升基准水平。
开始构建用于实现模型的端到端流水线。
模型优于基准模型并不意味着它最终会在生产环境中使用。他们应继续进行实验,这实际上可能会进一步提升基准。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eExperiments involve making single, small, iterative changes to model features, architecture, or hyperparameters to improve performance compared to a baseline.\u003c/p\u003e\n"],["\u003cp\u003eIt's crucial to track all experimental results, including unsuccessful ones, to understand which approaches work and which don't.\u003c/p\u003e\n"],["\u003cp\u003eTeams should establish clear experimentation practices, including defining artifacts, coding standards, reproducibility measures, and tracking methods.\u003c/p\u003e\n"],["\u003cp\u003ePlan for handling wrong predictions early on, potentially by incorporating user feedback for model improvement.\u003c/p\u003e\n"],["\u003cp\u003eConsider building parts of the final pipeline alongside experimentation to facilitate a smoother transition to production.\u003c/p\u003e\n"]]],[],null,["# Experiments drive a project toward viability. They are testable and\nreproducible hypotheses. When running experiments, the\ngoal is to make continual, incremental improvements by evaluating a variety of\nmodel architectures and features. When experimenting, you'll want to do\nthe following:\n\n- **Determine baseline performance.** Start by establishing a\n [baseline](/machine-learning/glossary#baseline) metric. The baseline\n acts as a measuring stick to compare experiments against.\n\n In some cases, the current non-ML solution can provide the first baseline\n metric. If no solution currently exists, create a ML model with a simple\n architecture, a few features, and use its metrics as the baseline.\n- **Make single, small changes.** Make only a single, small change at a time,\n for example, to the hyperparameters, architecture, or features. If the\n change improves the model, that model's metrics become the new baseline to\n compare future experiments against.\n\n The following are examples of experiments that make a single, small change:\n - include feature *X*.\n - use 0.5 dropout on the first hidden layer.\n - take the log transform of feature *Y*.\n - change the learning rate to 0.001.\n- **Record the progress of the experiments.** You'll most likely need to do\n lots of experiments. Experiments with poor (or neutral) prediction quality\n compared to the baseline are still useful to track. They signal which\n approaches won't work. Because progress is typically non-linear, it's\n important to show that you're working on the problem by highlighting all\n the ways you found that don't work---in addition to your progress at\n increasing the baseline quality.\n\nBecause each full training on a real-world dataset can take hours (or days),\nconsider running multiple independent experiments concurrently to explore the\nspace quickly. As you continue to iterate, you'll hopefully get closer and\ncloser to the level of quality you'll need for production.\n\n### Noise in experimental results\n\nNote that you might encounter noise in experimental results that aren't from\nchanges to the model or the data, making it difficult to determine if a change\nyou made actually improved the model. The following are examples of things that\ncan produce noise in experimental results:\n\n- Data shuffling: The order in which the data is presented to the model can\n affect the model's performance.\n\n- Variable initialization: The way in which the model's\n variables are initialized can also affect its performance.\n\n- Asynchronous parallelism: If the model is trained using asynchronous\n parallelism, the order in which the different parts of the model are updated\n can also affect its performance.\n\n- Small evaluation sets: If the evaluation set is too small, it may\n not be representative of the overall performance of the model, producing\n uneven variations in the model's quality.\n\nRunning an experiment multiple times helps confirm experimental results.\n\n### Align on experimentation practices\n\nYour team should have a clear understanding of what exactly an \"experiment\" is,\nwith a defined set of practices and artifacts. You'll want documentation that\noutlines the following:\n\n- **Artifacts.** What are the artifacts for an experiment? In most cases, an\n experiment is a tested hypothesis that can be reproduced, typically by\n logging the metadata (like the features and hyperparameters) that indicate\n the changes between experiments and how they affect model quality.\n\n- **Coding practices.** Will everyone use their own experimental environments?\n How possible (or easy) will it be to unify everyone's work into shared\n libraries?\n\n- **Reproducibility and tracking.** What are the standards for\n reproducibility? For instance, should the team use the same data pipeline\n and versioning practices, or is it OK to show only plots? How will\n experimental data be saved: as SQL queries or as model snapshots? Where will\n the logs from each experiment be documented: in a doc, a spreadsheet, or a\n CMS for managing experiments?\n\nWrong predictions\n-----------------\n\nNo real-world model is perfect. How will your system handle wrong predictions?\nBegin thinking early on about how to deal with them.\n\nA best-practices strategy encourages users to correctly label wrong predictions.\nFor example, mail apps capture misclassified email by logging the mail users\nmove into their spam folder, as well as the reverse. By capturing ground truth\nlabels from users, you can design automated feedback loops for data collection\nand model retraining.\n\nNote that although UI-embedded surveys capture user feedback, the data is\ntypically qualitative and can't be incorporated into the retraining data.\n\nImplement an end-to-end solution\n--------------------------------\n\nWhile your team is experimenting on the model, it's a good idea to start\nbuilding out parts of the final pipeline (if you have the resources to do so).\n\nEstablishing different pieces of the pipeline---like data intake and model\nretraining---makes it easier to move the final model to production. For\nexample, getting an end-to-end pipeline for ingesting data and serving\npredictions can help the team start integrating the model into the product and\nto begin conducting early-stage user testing.\n\nTroubleshooting stalled projects\n--------------------------------\n\nYou might be in scenarios where a project's progress stalls. Maybe your\nteam has been working on a promising experiment but hasn't had success\nimproving the model for weeks. What should you do? The following are some\npossible approaches:\n\n- **Strategic.** You might need to reframe the problem. After spending time in\n the experimentation phase, you probably understand the problem, the data,\n and the possible solutions better. With a deeper knowledge of the domain,\n you can probably frame the problem more precisely.\n\n For instance, maybe you initially wanted to use linear regression to predict\n a numeric value. Unfortunately, the data wasn't good enough to train a\n viable linear regression model. Maybe further analysis reveals the problem\n can be solved by predicting whether an example is above or below a specific\n value. This lets you reframe the problem as a binary classification one.\n\n If progress is slower than expected, don't give up. Incremental improvements\n over time might be the only way to solve the problem. As noted earlier,\n don't expect the same amount of progress week over week. Often, getting a\n production-ready version of a model requires substantial amounts of time.\n Model improvement can be irregular and unpredictable. Periods of slow\n progress can be followed by spikes in improvement, or the reverse.\n- **Technical.** Spend time diagnosing and analyzing wrong predictions. In\n some cases, you can find the issue by isolating a few wrong predictions and\n diagnosing the model's behavior in those instances. For example, you might\n uncover problems with the architecture or the data. In other cases,\n getting more data can help. You might get a clearer signal that suggests\n you're on the right path, or it might produce more noise, indicating other\n issues exist in the approach.\n\n If you're working on a problem that requires human-labeled datasets,\n getting a labeled dataset for model evaluation might be hard to obtain. Find\n resources to get the datasets you'll need for evaluation.\n\nMaybe no solution is possible. Time-box your approach, stopping if you haven't\nmade progress within the timeframe. However, if you have a strong problem\nstatement, then it probably warrants a solution.\n\n### Check Your Understanding\n\nA team member found a combination of hyperparameters that improves the baseline model metric. What should the other members of the team do? \nMaybe incorporate one hyperparameter, but continue with their experiments. \nCorrect. If one of their hyperparameters seems like a reasonable choice, try it. However, not all hyperparameter choices make sense in every experimental context. \nChange all their hyperparameters in their current experiment to match their co-worker's. \nHyperparameters that improved one model doesn't mean they'll also improve a different model. The other teammates should continue with their experiments, which might actually improve the baseline even more later on. \nStart building an end-to-end pipeline that will be used to implement the model. \nA model that improves the baseline doesn't mean it's the model that will ultimately be used in production. They should continue with their experiments, which might actually improve the baseline even more later on.\n| **Key Terms:**\n|\n| - [baseline](/machine-learning/glossary#baseline)"]]