製作
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
如要準備實際工作環境的機器學習管道,請完成下列步驟:
佈建運算資源
執行 ML 管道需要運算資源,例如 RAM、CPU 和 GPU/TPU。
如果運算資源不足,您就無法執行管道。因此,請務必取得足夠的配額,以佈建管道在正式環境中執行所需的資源。
您可能不會為每個管道分配配額。而是分配管道共用的配額。在這種情況下,請確認您有足夠配額來執行所有管道,並設定監控和警示,避免單一錯誤管道耗用所有配額。
預估配額
如要估算資料和訓練管道所需的配額,請找出類似專案做為估算依據。如要估算服務配額,請嘗試預測服務的每秒查詢次數。這些方法可做為基準。在實驗階段開始製作解決方案原型時,您會開始取得更精確的配額預估值。
估算配額時,請記得將生產管線和持續進行的實驗配額納入考量。
隨堂測驗
選擇用於提供預測的硬體時,請務必選用比訓練模型時更強大的硬體。
記錄、監控和快訊
記錄及監控正式模型行為至關重要。完善的監控基礎架構可確保模型提供可靠的高品質預測。
良好的記錄和監控做法有助於主動找出 ML 管道中的問題,並減輕潛在的業務影響。發生問題時,系統會發出快訊通知團隊成員,而詳盡的記錄則有助於診斷問題的根本原因。
您應實作記錄和監控功能,偵測機器學習管道的下列問題:
pipeline |
監控 |
供應 |
- 與訓練資料相比,服務資料出現偏移或漂移
- 預測結果出現偏差或漂移
- 資料類型問題,例如缺少或損毀的值
- 配額使用量
- 模型品質指標
|
資料 |
- 特徵值中的偏斜和漂移
- 標籤值中的偏差和漂移
- 資料類型問題,例如缺少或損毀的值
- 配額用量比率
- 即將達到配額上限
|
訓練 |
|
驗證 |
|
您也需要記錄、監控及快訊功能,以便掌握下列資訊:
- 延遲時間。預測結果需要多久才能送達?
- 服務中斷。模型是否已停止提供預測結果?
隨堂測驗
下列哪項是記錄及監控機器學習管道的主要原因?
以上皆是
沒錯。記錄及監控機器學習管道有助於預防及診斷問題,避免問題惡化。
部署模型
如要部署模型,建議您記錄下列事項:
- 必須獲得核准才能開始部署及擴大推出範圍。
- 如何將模型投入實際工作環境。
- 模型部署位置,例如是否有暫存或 Canary 環境。
- 部署作業失敗時的處理方式。
- 如何復原已在正式環境中使用的模型。
自動化模型訓練後,您會想自動化驗證和部署作業。自動化部署可分散責任,並降低部署作業因單一人員而受阻的可能性。此外,這項功能還能減少潛在錯誤、提高效率和可靠性,並支援輪值和 SRE 支援。
通常您會將新模型部署至部分使用者,確認模型運作正常。如果是,請繼續部署。如果不是,請回溯部署作業,並開始診斷及偵錯問題。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-27 (世界標準時間)。
[null,null,["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eProduction ML pipelines require sufficient compute resources like RAM, CPUs, and GPUs/TPUs for serving, training, data processing, and validation.\u003c/p\u003e\n"],["\u003cp\u003eImplement robust logging, monitoring, and alerting to proactively detect data and model issues (e.g., data drift, prediction skews, quality degradation) across all pipeline stages.\u003c/p\u003e\n"],["\u003cp\u003eEstablish a clear model deployment strategy outlining approvals, procedures, environments, and rollback mechanisms, and aim for automated deployments for efficiency and reliability.\u003c/p\u003e\n"],["\u003cp\u003eEstimate quota needs based on similar projects and service predictions, and factor in resources for both production and ongoing experimentation.\u003c/p\u003e\n"]]],[],null,["# Productionization\n\nTo prepare your ML pipelines for production, you need to do the following:\n\n- Provision compute resources for your pipelines\n- Implement logging, monitoring, and alerting\n\nProvisioning compute resources\n------------------------------\n\nRunning ML pipelines requires compute resources, like RAM, CPUs, and GPUs/TPUs.\nWithout adequate compute, you can't run your pipelines. Therefore, make sure\nto get sufficient quota to provision the required resources your pipelines\nneed to run in production.\n\n- **Serving, training, and validation pipelines**. These pipelines require\n TPUs, GPUs, or CPUs. Depending on your use case, you might train and serve\n on different hardware, or use the same hardware. For example, training might\n happen on CPUs but serving might use TPUs, or vice versa. In general, it's\n common to train on bigger hardware and then serve on smaller hardware.\n\n \u003cbr /\u003e\n\n When picking hardware, consider the following:\n - Can you train on less expensive hardware?\n - Would switching to different hardware boost performance?\n - What size is the model and which hardware will optimize its performance?\n - What hardware is ideal based on your model's architecture?\n\n | **Note:** When switching models between hardware, consider the time and effort to migrate the model. Switching hardware might make the model cheaper to run, but the engineering effort to do so might outweigh the savings---or engineering effort might be better prioritized on other work.\n- **Data pipelines**. Data pipelines require quota for RAM and CPU\n\n You'll need to estimate how\n much quota your pipeline needs to generate training and test datasets.\n\nYou might not allocate quota for each pipeline. Instead, you might\nallocate quota that pipelines share. In such cases, verify\nyou have enough quota to run all your pipelines, and set up monitoring and\naltering to prevent a single, errant pipeline from consuming all the quota.\n\n### Estimating quota\n\nTo estimate the quota you'll need for the data and training pipelines, find\nsimilar projects to base your estimates on. To estimate serving quota, try to\npredict the service's queries per second. These methods provide a baseline. As\nyou begin prototyping a solution during the experimentation phase, you'll begin\nto get a more precise quota estimate.\n\nWhen estimating quota, remember to factor in quota not only for your production\npipelines, but also for ongoing experiments.\n\n### Check Your Understanding\n\nWhen choosing hardware to serve predictions, you should always choose more powerful hardware than was used to train the model. \nFalse \nCorrect. Typically, training requires bigger hardware than serving. \nTrue \n\nLogging, monitoring, and alerting\n---------------------------------\n\nLogging and monitoring a production model's behavior is critical. Robust\nmonitoring infrastructure confirms your models are serving reliable,\nhigh-quality predictions.\n\nGood logging and monitoring practices help proactively identify issues in ML\npipelines and mitigate potential business impact. When issues do occur, alerts\nnotify members of your team, and comprehensive logs facilitate diagnosing the\nproblem's root cause.\n\nYou should implement logging and monitoring to detect the following issues\nwith ML pipelines:\n\n| Pipeline | Monitor |\n|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Serving | - Skews or drifts in the serving data compared to the training data - Skews or drifts in predictions - Data type issues, like missing or corrupted values - Quota usage - Model quality metrics Calculating a production model's quality is different than calculating a model's quality during training. In production, you won't necessarily have access to the ground truth to compare predictions against. Instead, you'll need to write custom monitoring instrumentation to capture metrics that act as a proxy for model quality. For example, in a mail app, you won't know which mail is spam in real time. Instead, you can monitor the percentage of mail users move to spam. If the number jumps from 0.5% to 3%, that signals a potential issue with the model. | Note that comparing the changes in | the proxy metrics is more insightful than their raw numbers. |\n| Data | - Skews and drifts in feature values - Skews and drifts in label values - Data type issues, like missing or corrupted values - Quota usage rate - Quota limit about to be reached |\n| Training | - Training time - Training failures - Quota usage |\n| Validation | - Skew or drift in the test datasets |\n\nYou'll also want logging, monitoring, alerting for the following:\n\n- **Latency**. How long does it take to deliver a prediction?\n- **Outages**. Has the model stopped delivering predictions?\n\n### Check Your Understanding\n\nWhich of the following is the main reason for logging and monitoring your ML pipelines? \nProactively detect issues before they impact users \nTrack quota and resource usage \nIdentify potential security problems \nAll of the above \nCorrect. Logging and monitoring your ML pipelines helps prevent and diagnose problems before they become serious.\n\nDeploying a model\n-----------------\n\nFor model deployment, you'll want to document the following:\n\n- Approvals required to begin deployment and increase the roll out.\n- How to put a model into production.\n- Where the model gets deployed, for example, if there are staging or canary environments.\n- What to do if a deployment fails.\n- How to rollback a model already in production.\n\nAfter automating model training, you'll want to automate\nvalidation and deployment. Automating deployments distributes\nresponsibility and reduces the likelihood of a deployment being bottlenecked by\na single person. It also reduces potential mistakes, increases efficiency and\nreliability, and enables on-call rotations and SRE support.\n\nTypically you deploy new models to a subset of users to check that the model is\nbehaving as expected. If it is, continue with the deployment. If it's not,\nyou rollback the deployment and begin diagnosing and debugging the issues."]]