實驗
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
實驗可讓專案更具可行性。這些假設是可測試且可重現的假設。執行實驗時,目標是評估各種模型架構和功能,持續改善模型。實驗時,請務必執行下列操作:
判斷基準成效。首先,請建立基準指標。基準可做為比較實驗的量測標準。
在某些情況下,目前的非機器學習解決方案可以提供第一個基準指標。如果目前沒有解決方案,請建立具有簡單架構和少數特徵的機器學習模型,並使用該模型的指標做為基準。
進行單一小幅變更。每次只進行單一小幅變更,例如超參數、架構或功能。如果變更可改善模型,該模型的指標就會成為新基準,用於比較日後的實驗。
以下是實驗的示例,其中只會進行單一小幅變更:
- 包含功能 X。
- 在第一個隱藏層使用 0.5 的 dropout。
- 對 Y 特徵進行對數轉換。
- 將學習率變更為 0.001。
記錄實驗進度。您很可能需要進行大量實驗。與基準相比,實驗的預測品質較差 (或中性) 仍有助於追蹤。並指出哪些方法無法運作。由於進度通常是非線性的,因此除了提高基準品質的進度外,您還必須強調所有發現的無效方法,以證明自己正在努力解決問題。
由於在實際資料集上執行完整訓練可能需要數小時 (或數天),因此建議同時執行多項獨立實驗,以便快速探索空間。隨著您不斷進行疊代,希望能越來越接近實際工作環境所需的品質等級。
實驗結果中的雜訊
請注意,實驗結果可能會出現雜訊,而這並非來自模型或資料的變更,因此很難判斷您所做的變更是否確實改善了模型。以下列舉可能會在實驗結果中產生雜訊的項目:
資料重新排列:資料提供給模型的順序可能會影響模型的效能。
變數初始化:模型變數的初始化方式也會影響效能。
非同步並行:如果模型是使用非同步並行方式訓練,模型的不同部分更新順序也會影響其效能。
評估集太小:如果評估集太小,可能無法代表模型的整體效能,導致模型品質出現不均勻的變化。
多次執行實驗有助於確認實驗結果。
統一實驗做法
您的團隊應清楚瞭解「實驗」的確切定義,並定義一組實驗做法和產物。您需要的文件應概略說明以下內容:
構件。實驗的構件是什麼?在大多數情況下,實驗是可重現的已測試假設,通常會記錄中繼資料 (例如功能和超參數),指出實驗之間的變化,以及這些變化如何影響模型品質。
程式設計做法。每位參與者都會使用自己的實驗環境嗎?將每個人的成果整合到共用程式庫的可能性 (或難易度) 為何?
可重現性和追蹤。可重現性的標準為何?舉例來說,團隊是否應使用相同的資料管道和版本管理做法,還是只顯示圖表即可?您要如何儲存實驗資料:做為 SQL 查詢或模型快照?每個實驗的記錄會記錄在哪裡:文件、試算表或用於管理實驗的內容管理系統 (CMS)?
預測結果錯誤
現實世界中的模型都難免有缺陷。系統如何處理錯誤的預測結果?請盡早思考如何處理這些問題。
最佳做法策略鼓勵使用者正確標記錯誤的預測結果。舉例來說,郵件應用程式會記錄使用者將郵件移至垃圾郵件資料夾的情況,以及反之情況,藉此擷取錯誤分類的郵件。擷取使用者的真值標籤,您就能設計自動化意見回饋迴圈,用於收集資料和重新訓練模型。
請注意,雖然 UI 嵌入式問卷可擷取使用者意見回饋,但資料通常是定性資料,無法納入重新訓練資料。
導入端對端解決方案
在團隊進行模型實驗時,建議您開始建構最終管道的部分 (如果有相關資源的話)。
建立管道的不同部分 (例如資料輸入和模型重新訓練),可讓您更輕鬆地將最終模型移至實際工作環境。舉例來說,取得用於擷取資料和提供預測結果的端對端管道,有助於團隊開始將模型整合至產品,並開始進行早期使用者測試。
排解停滯專案的問題
您可能會遇到專案進度停滯的情況。也許您的團隊一直在進行有前景的實驗,但經過數週後仍未成功改善模型。這時該怎麼做呢?以下是一些可能的方法:
策略性。您可能需要重新定義問題。在實驗階段花費一段時間後,您可能會更瞭解問題、資料和可能的解決方案。對該領域有更深入的瞭解後,您可能就能更準確地界定問題。
舉例來說,您可能一開始想使用線性迴歸來預測數值。很抱歉,資料不夠好,無法訓練可行的線性迴歸模型。進一步分析後,您可能會發現,只要預測範例是否高於或低於特定值,就能解決問題。這可讓您將問題重新定義為二元分類問題。
如果進度不如預期,請不要放棄。隨著時間的推移,逐步改善可能是解決問題的唯一方法。如先前所述,請勿期待每週的進度都相同。通常,要取得可供實際使用的模型版本,需要花費大量時間。模型改善作業可能會不規則且難以預測。在進展緩慢的時期,進度可能會突然大幅提升,反之亦然。
技術。花時間診斷及分析錯誤的預測結果。在某些情況下,您可以找出幾個錯誤預測,並診斷這些例項中的模型行為,以便找出問題。舉例來說,您可能會發現架構或資料的問題。在其他情況下,取得更多資料可能有助於解決問題。您可能會收到更明確的信號,表示您走在正確的路徑上,但也可能會產生更多雜訊,表示方法中存在其他問題。
如果您要解決的問題需要人工標記的資料集,那麼要取得用於評估模型的標記資料集可能會很困難。尋找資源,取得評估所需的資料集。
也許沒有解決方案。設定時間限制,如果您在指定時間內未取得進展,就停止執行。不過,如果您有明確的問題陳述,那麼可能就需要解決方案。
隨堂測驗
團隊成員發現超參數組合可改善基準模型指標。團隊其他成員應該做些什麼?
或許可以加入一個超參數,但繼續進行實驗。
沒錯。如果其中一個超參數似乎是合理的選擇,請試試看。不過,並非所有超參數選項都適合用於每個實驗情境。
將目前實驗中的所有超參數變更為與同事一致。
雖然超參數可改善某個模型,但不代表它也能改善其他模型。其他團隊成員應繼續進行實驗,這樣日後可能會進一步改善基準。
開始建構要用來實作模型的端對端管道。
雖然模型可改善基準,但不代表最終會用於實際工作環境。他們應該繼續進行實驗,這樣日後可能會進一步改善基準。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-27 (世界標準時間)。
[null,null,["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eExperiments involve making single, small, iterative changes to model features, architecture, or hyperparameters to improve performance compared to a baseline.\u003c/p\u003e\n"],["\u003cp\u003eIt's crucial to track all experimental results, including unsuccessful ones, to understand which approaches work and which don't.\u003c/p\u003e\n"],["\u003cp\u003eTeams should establish clear experimentation practices, including defining artifacts, coding standards, reproducibility measures, and tracking methods.\u003c/p\u003e\n"],["\u003cp\u003ePlan for handling wrong predictions early on, potentially by incorporating user feedback for model improvement.\u003c/p\u003e\n"],["\u003cp\u003eConsider building parts of the final pipeline alongside experimentation to facilitate a smoother transition to production.\u003c/p\u003e\n"]]],[],null,["# Experiments drive a project toward viability. They are testable and\nreproducible hypotheses. When running experiments, the\ngoal is to make continual, incremental improvements by evaluating a variety of\nmodel architectures and features. When experimenting, you'll want to do\nthe following:\n\n- **Determine baseline performance.** Start by establishing a\n [baseline](/machine-learning/glossary#baseline) metric. The baseline\n acts as a measuring stick to compare experiments against.\n\n In some cases, the current non-ML solution can provide the first baseline\n metric. If no solution currently exists, create a ML model with a simple\n architecture, a few features, and use its metrics as the baseline.\n- **Make single, small changes.** Make only a single, small change at a time,\n for example, to the hyperparameters, architecture, or features. If the\n change improves the model, that model's metrics become the new baseline to\n compare future experiments against.\n\n The following are examples of experiments that make a single, small change:\n - include feature *X*.\n - use 0.5 dropout on the first hidden layer.\n - take the log transform of feature *Y*.\n - change the learning rate to 0.001.\n- **Record the progress of the experiments.** You'll most likely need to do\n lots of experiments. Experiments with poor (or neutral) prediction quality\n compared to the baseline are still useful to track. They signal which\n approaches won't work. Because progress is typically non-linear, it's\n important to show that you're working on the problem by highlighting all\n the ways you found that don't work---in addition to your progress at\n increasing the baseline quality.\n\nBecause each full training on a real-world dataset can take hours (or days),\nconsider running multiple independent experiments concurrently to explore the\nspace quickly. As you continue to iterate, you'll hopefully get closer and\ncloser to the level of quality you'll need for production.\n\n### Noise in experimental results\n\nNote that you might encounter noise in experimental results that aren't from\nchanges to the model or the data, making it difficult to determine if a change\nyou made actually improved the model. The following are examples of things that\ncan produce noise in experimental results:\n\n- Data shuffling: The order in which the data is presented to the model can\n affect the model's performance.\n\n- Variable initialization: The way in which the model's\n variables are initialized can also affect its performance.\n\n- Asynchronous parallelism: If the model is trained using asynchronous\n parallelism, the order in which the different parts of the model are updated\n can also affect its performance.\n\n- Small evaluation sets: If the evaluation set is too small, it may\n not be representative of the overall performance of the model, producing\n uneven variations in the model's quality.\n\nRunning an experiment multiple times helps confirm experimental results.\n\n### Align on experimentation practices\n\nYour team should have a clear understanding of what exactly an \"experiment\" is,\nwith a defined set of practices and artifacts. You'll want documentation that\noutlines the following:\n\n- **Artifacts.** What are the artifacts for an experiment? In most cases, an\n experiment is a tested hypothesis that can be reproduced, typically by\n logging the metadata (like the features and hyperparameters) that indicate\n the changes between experiments and how they affect model quality.\n\n- **Coding practices.** Will everyone use their own experimental environments?\n How possible (or easy) will it be to unify everyone's work into shared\n libraries?\n\n- **Reproducibility and tracking.** What are the standards for\n reproducibility? For instance, should the team use the same data pipeline\n and versioning practices, or is it OK to show only plots? How will\n experimental data be saved: as SQL queries or as model snapshots? Where will\n the logs from each experiment be documented: in a doc, a spreadsheet, or a\n CMS for managing experiments?\n\nWrong predictions\n-----------------\n\nNo real-world model is perfect. How will your system handle wrong predictions?\nBegin thinking early on about how to deal with them.\n\nA best-practices strategy encourages users to correctly label wrong predictions.\nFor example, mail apps capture misclassified email by logging the mail users\nmove into their spam folder, as well as the reverse. By capturing ground truth\nlabels from users, you can design automated feedback loops for data collection\nand model retraining.\n\nNote that although UI-embedded surveys capture user feedback, the data is\ntypically qualitative and can't be incorporated into the retraining data.\n\nImplement an end-to-end solution\n--------------------------------\n\nWhile your team is experimenting on the model, it's a good idea to start\nbuilding out parts of the final pipeline (if you have the resources to do so).\n\nEstablishing different pieces of the pipeline---like data intake and model\nretraining---makes it easier to move the final model to production. For\nexample, getting an end-to-end pipeline for ingesting data and serving\npredictions can help the team start integrating the model into the product and\nto begin conducting early-stage user testing.\n\nTroubleshooting stalled projects\n--------------------------------\n\nYou might be in scenarios where a project's progress stalls. Maybe your\nteam has been working on a promising experiment but hasn't had success\nimproving the model for weeks. What should you do? The following are some\npossible approaches:\n\n- **Strategic.** You might need to reframe the problem. After spending time in\n the experimentation phase, you probably understand the problem, the data,\n and the possible solutions better. With a deeper knowledge of the domain,\n you can probably frame the problem more precisely.\n\n For instance, maybe you initially wanted to use linear regression to predict\n a numeric value. Unfortunately, the data wasn't good enough to train a\n viable linear regression model. Maybe further analysis reveals the problem\n can be solved by predicting whether an example is above or below a specific\n value. This lets you reframe the problem as a binary classification one.\n\n If progress is slower than expected, don't give up. Incremental improvements\n over time might be the only way to solve the problem. As noted earlier,\n don't expect the same amount of progress week over week. Often, getting a\n production-ready version of a model requires substantial amounts of time.\n Model improvement can be irregular and unpredictable. Periods of slow\n progress can be followed by spikes in improvement, or the reverse.\n- **Technical.** Spend time diagnosing and analyzing wrong predictions. In\n some cases, you can find the issue by isolating a few wrong predictions and\n diagnosing the model's behavior in those instances. For example, you might\n uncover problems with the architecture or the data. In other cases,\n getting more data can help. You might get a clearer signal that suggests\n you're on the right path, or it might produce more noise, indicating other\n issues exist in the approach.\n\n If you're working on a problem that requires human-labeled datasets,\n getting a labeled dataset for model evaluation might be hard to obtain. Find\n resources to get the datasets you'll need for evaluation.\n\nMaybe no solution is possible. Time-box your approach, stopping if you haven't\nmade progress within the timeframe. However, if you have a strong problem\nstatement, then it probably warrants a solution.\n\n### Check Your Understanding\n\nA team member found a combination of hyperparameters that improves the baseline model metric. What should the other members of the team do? \nMaybe incorporate one hyperparameter, but continue with their experiments. \nCorrect. If one of their hyperparameters seems like a reasonable choice, try it. However, not all hyperparameter choices make sense in every experimental context. \nChange all their hyperparameters in their current experiment to match their co-worker's. \nHyperparameters that improved one model doesn't mean they'll also improve a different model. The other teammates should continue with their experiments, which might actually improve the baseline even more later on. \nStart building an end-to-end pipeline that will be used to implement the model. \nA model that improves the baseline doesn't mean it's the model that will ultimately be used in production. They should continue with their experiments, which might actually improve the baseline even more later on.\n| **Key Terms:**\n|\n| - [baseline](/machine-learning/glossary#baseline)"]]