Productionization

Page Summary

Production ML pipelines require sufficient compute resources like RAM, CPUs, and GPUs/TPUs for serving, training, data processing, and validation.
Implement robust logging, monitoring, and alerting to proactively detect data and model issues (e.g., data drift, prediction skews, quality degradation) across all pipeline stages.
Establish a clear model deployment strategy outlining approvals, procedures, environments, and rollback mechanisms, and aim for automated deployments for efficiency and reliability.
Estimate quota needs based on similar projects and service predictions, and factor in resources for both production and ongoing experimentation.

To prepare your ML pipelines for production, you need to do the following:

Provision compute resources for your pipelines
Implement logging, monitoring, and alerting

Provisioning compute resources

Running ML pipelines requires compute resources, like RAM, CPUs, and GPUs/TPUs. Without adequate compute, you can't run your pipelines. Therefore, make sure to get sufficient quota to provision the required resources your pipelines need to run in production.

Serving, training, and validation pipelines. These pipelines require TPUs, GPUs, or CPUs. Depending on your use case, you might train and serve on different hardware, or use the same hardware. For example, training might happen on CPUs but serving might use TPUs, or vice versa. In general, it's common to train on bigger hardware and then serve on smaller hardware.

When picking hardware, consider the following:
- Can you train on less expensive hardware?
- Would switching to different hardware boost performance?
- What size is the model and which hardware will optimize its performance?
- What hardware is ideal based on your model's architecture?
Note: When switching models between hardware, consider the time and effort to migrate the model. Switching hardware might make the model cheaper to run, but the engineering effort to do so might outweigh the savings—or engineering effort might be better prioritized on other work.
Data pipelines. Data pipelines require quota for RAM and CPU You'll need to estimate how much quota your pipeline needs to generate training and test datasets.

You might not allocate quota for each pipeline. Instead, you might allocate quota that pipelines share. In such cases, verify you have enough quota to run all your pipelines, and set up monitoring and altering to prevent a single, errant pipeline from consuming all the quota.

Estimating quota

To estimate the quota you'll need for the data and training pipelines, find similar projects to base your estimates on. To estimate serving quota, try to predict the service's queries per second. These methods provide a baseline. As you begin prototyping a solution during the experimentation phase, you'll begin to get a more precise quota estimate.

When estimating quota, remember to factor in quota not only for your production pipelines, but also for ongoing experiments.

Check Your Understanding

When choosing hardware to serve predictions, you should always choose more powerful hardware than was used to train the model.

False

Correct. Typically, training requires bigger hardware than serving.

True

Logging, monitoring, and alerting

Logging and monitoring a production model's behavior is critical. Robust monitoring infrastructure confirms your models are serving reliable, high-quality predictions.

Good logging and monitoring practices help proactively identify issues in ML pipelines and mitigate potential business impact. When issues do occur, alerts notify members of your team, and comprehensive logs facilitate diagnosing the problem's root cause.

You should implement logging and monitoring to detect the following issues with ML pipelines:

Pipeline	Monitor
Serving	Skews or drifts in the serving data compared to the training data Skews or drifts in predictions Data type issues, like missing or corrupted values Quota usage Model quality metrics Calculating a production model's quality is different than calculating a model's quality during training. In production, you won't necessarily have access to the ground truth to compare predictions against. Instead, you'll need to write custom monitoring instrumentation to capture metrics that act as a proxy for model quality. For example, in a mail app, you won't know which mail is spam in real time. Instead, you can monitor the percentage of mail users move to spam. If the number jumps from 0.5% to 3%, that signals a potential issue with the model. Note that comparing the changes in the proxy metrics is more insightful than their raw numbers.
Data	Skews and drifts in feature values Skews and drifts in label values Data type issues, like missing or corrupted values Quota usage rate Quota limit about to be reached
Training	Training time Training failures Quota usage
Validation	Skew or drift in the test datasets

You'll also want logging, monitoring, alerting for the following:

Latency. How long does it take to deliver a prediction?
Outages. Has the model stopped delivering predictions?

Check Your Understanding

Which of the following is the main reason for logging and monitoring your ML pipelines?

Proactively detect issues before they impact users

Track quota and resource usage

Identify potential security problems

All of the above

Correct. Logging and monitoring your ML pipelines helps prevent and diagnose problems before they become serious.

Deploying a model

For model deployment, you'll want to document the following:

Approvals required to begin deployment and increase the roll out.
How to put a model into production.
Where the model gets deployed, for example, if there are staging or canary environments.
What to do if a deployment fails.
How to rollback a model already in production.

After automating model training, you'll want to automate validation and deployment. Automating deployments distributes responsibility and reduces the likelihood of a deployment being bottlenecked by a single person. It also reduces potential mistakes, increases efficiency and reliability, and enables on-call rotations and SRE support.

Typically you deploy new models to a subset of users to check that the model is behaving as expected. If it is, continue with the deployment. If it's not, you rollback the deployment and begin diagnosing and debugging the issues.

ML pipelines

AI and ML ethics

Productionization Stay organized with collections Save and categorize content based on your preferences.

Page Summary

Provisioning compute resources

Estimating quota

Check Your Understanding

Logging, monitoring, and alerting

Check Your Understanding

Deploying a model

Productionization