ML pipelines

In production ML, the goal isn't to build a single model and deploy it. The goal is to build automated pipelines for developing, testing, and deploying models over time. Why? As the world changes, trends in data shift, causing models in production to go stale. Models typically need retraining with up-to-date data to continue serving high-quality predictions over the long term. In other words, you'll want a way to replace stale models with fresh ones.

Without pipelines, replacing a stale model is an error-prone process. For example, once a model starts serving bad predictions, someone will need to manually collect and process new data, train a new model, validate its quality, and then finally deploy it. ML pipelines automate many of these repetitive processes, making the management and maintenance of models more efficient and reliable.

Building pipelines

ML pipelines organize the steps for building and deploying models into well-defined tasks. Pipelines have one of two functions: delivering predictions or updating the model.

Delivering predictions

The serving pipeline delivers predictions. It exposes your model to the real world, making it accessible to your users. For example, when a user wants a prediction—what the weather will be like tomorrow, or how many minutes it'll take to travel to the airport, or a list of recommended videos—the serving pipeline receives and processes the user's data, makes a prediction, and then delivers it to the user.

Updating the model

Models tend to go stale almost immediately after they go into production. In essence, they're making predictions using old information. Their training datasets captured the state of the world a day ago, or in some cases, an hour ago. Inevitably the world has changed: a user has watched more videos and needs a new list of recommendations; rain has caused traffic to slow and users need updated estimates for their arrival times; a popular trend causes retailers to request updated inventory predictions for certain items.

Typically, teams train new models well before the production model goes stale. In some cases, teams train and deploy new models daily in a continuous training and deployment cycle. Ideally, training a new model should happen well before the production model goes stale.

The following pipelines work together to train a new model:

Data pipeline. The data pipeline processes user data to create training and test datasets.
Training pipeline. The training pipeline trains models using the new training datasets from the data pipeline.
Validation pipeline. The validation pipeline validates the trained model by comparing it with the production model using test datasets generated by the data pipeline.

Figure 4 illustrates the inputs and outputs of each ML pipeline.

ML pipelines

ML pipelines showing their inputs and outputs. The serving pipeline
takes user input and delivers predictions. The data pipeline processes
the application data logs to create training and test datasets that the
training and validation pipeline use to train and validate new
models

Figure 4. ML pipelines automate many processes for developing and maintaining models. Each pipeline shows its inputs and outputs.

At a very general level, here's how the pipelines keep a fresh model in production:

First, a model goes into production, and the serving pipeline starts delivering predictions.
The data pipeline immediately begins collecting data to generate new training and test datasets.
Based on a schedule or a trigger, the training and validation pipelines train and validate a new model using the datasets generated by the data pipeline.
When the validation pipeline confirms the new model isn't worse than the production model, the new model gets deployed.
This process repeats continuously.

Model staleness and training frequency

Almost all models go stale. Some models go stale faster than others. For example, models that recommend clothes typically go stale quickly because consumer preferences are notorious for changing frequently. On the other hand, models that identify flowers might never go stale. A flower's identifying characteristics remain stable.

Most models begin to go stale immediately after they're put into production. You'll want to establish a training frequency that reflects the nature of your data. If the data is dynamic, train often. If it's less dynamic, you might not need to train that often.

Train models before they go stale. Early training provides a buffer to resolve potential issues, for example, if the data or training pipeline fails, or the model quality is poor.

A recommended best practice is to train and deploy new models on a daily basis. Just like regular software projects that have a daily build and release process, ML pipelines for training and validation often do best when ran daily.

Check Your Understanding

Which of the following models is likely to go stale quickly and will need to be constantly replaced with one trained on newer data? Select all that apply.

Predicts spam

Correct. These model use data that changes constantly in response to a collection of factors, like new spam tactics. As a result, they need to be constantly updated to respond to ever-changing trends.

Recommends clothes

Correct. These model use data that changes constantly in response to a collection of factors, like consumer preferences. As a result, they need to be constantly updated to respond to ever-changing trends.

Classifies bird species

Bird species don't change over time.

Predicts if transaction is fraudulent

Correct. These model use data that changes constantly in response to a collection of factors, like new fraud tactics. As a result, they need to be constantly updated to respond to ever-changing trends.

Serving pipeline

The serving pipeline generates and delivers predictions in one of two ways: online or offline.

Online predictions. Online predictions happen in real time, typically by sending a request to an online server and returning a prediction. For example, when a user wants a prediction, the user's data is sent to the model and the model returns the prediction.
Offline predictions. Offline predictions are precomputed and cached. To serve a prediction, the app finds the cached prediction in the database and returns it. For example, a subscription-based service might predict the churn rate for its subscribers. The model predicts the likelihood of churning for every subscriber and caches it. When the app needs the prediction—for instance, to incentive users who might be about to churn—it just looks up the precomputed prediction.

Figure 5 shows how online and offline predictions get generated and delivered.

Online and offline predictions

Predictions can be delivered in real time or batched and cached for lookup.

Figure 5. Online predictions deliver predictions in real time. Offline predictions are cached and looked up at serving time.

Prediction post-processing

Typically, predictions get post-processed before they're delivered. For example, predictions might be post-processed to remove toxic or biased content. Classification results might go through a process to reorder results instead of showing the model's raw output, for instance, to boost more authoritative content, present a diversity of results, demote particular results (like clickbait), or remove results for legal reasons.

Figure 6 shows a serving pipeline and the typical tasks involved in delivering predictions.

Post-processing predictions

The serving pipeline typically post-processing predictions.

Figure 6. Serving pipeline illustrating the typical tasks involved to deliver predictions.

Note that the feature engineering step is typically built within the model and not a separate, stand-alone process. The data processing code in the serving pipeline is often nearly identical to the data processing code the data pipeline uses to create training and test datasets.

Assets and metadata storage

The serving pipeline should incorporate a repository to log model predictions and, if possible, the ground truth.

Logging model predictions lets you monitor the quality of your model. By aggregating predictions, you can monitor the general quality of your model and determine if it's starting to lose quality. Generally, the production model's predictions should have the same average as the labels from the training dataset. For more information, see prediction bias.

Capturing ground truth

In some cases, the ground truth only becomes available much later. For example, if a weather app predicts the weather six weeks into the future, the ground truth (what the weather actually is) won't be available for six weeks.

When possible, get users to report the ground truth by adding feedback mechanisms into the app. A mail app can implicitly captures user feedback when users move mail from their inbox to their spam folder. However, this only works when user correctly categorize their mail. When users leave spam in their inbox (because they know it's spam and never open it), the training data becomes inaccurate. That particular piece of mail will be labeled "not spam" when it should be "spam." In other words, always try to find ways to capture and record the ground truth, but be aware of the shortcomings that might exist in feedback mechanisms.

Figure 7 shows predictions being delivered to a user and logged to a repository.

Logging predictions

The serving pipeline should log predictions to monitor model staleness.

Figure 7. Log predictions to monitor model quality.

Data pipelines

Data pipelines generate training and test datasets from application data. The training and validation pipelines then use the datasets to train and validate new models.

The data pipeline creates training and test datasets with the same features and label originally used to train the model, but with newer information. For example, a maps app would generate training and test datasets from recent travel times between points for millions of users, along with other relevant data, like the weather.

A video recommendation app would generate training and test datasets that included the videos a user clicked on from the recommended list (along with the ones not clicked), as well as other relevant data, like watch history.

Figure 8 illustrates the data pipeline using application data to generate training and test datasets.

Data pipeline

The data pipeline generates training and test datasets.

Figure 8. Data pipeline processes application data to create datasets for the training and validation pipelines.

Data collection and processing

The tasks for collecting and processing data in data pipelines will probably differ from the experimentation phase (where you determined that your solution was feasible):

Data collection. During experimentation, collecting data typically requires accessing saved data. For data pipelines, collecting data might require discovering and getting approval to access streaming logs data.

If you need human-labeled data (like medical images), you'll need a process for collecting and updating it as well.
Data processing. During experimentation, the right features came from scraping, joining, and sampling the experimentation datasets. For the data pipelines, generating those same features might require entirely different processes. However, be sure to duplicate the data transformations from the experimentation phase by applying the same mathematical operations to the features and labels.

Assets and metadata storage

You'll need a process for storing, versioning, and managing your training and test datasets. Version controlled repositories provide the following benefits:

Reproducibility. Recreate and standardize model training environments and compare prediction quality among different models.
Compliance. Adhere to regulatory compliance requirements for auditability and transparency.
Retention. Set data retention values for how long to store the data.
Access management. Manage who can access your data through fine-grained permissions.
Data integrity. Track and understand changes to datasets over time, making it easier to diagnose issues with your data or your model.
Discoverability. Make it easy for others to find your datasets and features. Other teams can then determine if they'd be useful for their purposes.

Documenting your data

Good documentation helps others understand key information about your data, like its type, source, size, and other essential metadata. In most cases, documenting your data in a design doc is sufficient. If you plan on sharing or publishing your data, use data cards to structure the information. Data cards make it easier for others to discover and understand your datasets.

Training and validation pipelines

The training and validation pipelines produce new models to replace production models before they go stale. Continually training and validating new models ensures the best model is always in production.

The training pipeline generates a new model from the training datasets, and the validation pipeline compares the quality of the new model with the one in production using test datasets.

Figure 9 illustrates the training pipeline using a training dataset to train a new model.

Training pipeline

The training pipeline trains new models on fresh data.

Figure 9. The training pipeline trains new models using the most recent training dataset.

After the model is trained, the validation pipeline uses test datasets to compare the production model's quality against the trained model.

In general, if the trained model isn't meaningfully worse than the production model, the trained model goes into production. If the trained model is worse, the monitoring infrastructure should create an alert. Trained models with worse prediction quality could indicate potential issues with the data or validation pipelines. This approach works to ensure the best model, trained on the freshest data, is always in production.

Assets and metadata storage

Models and their metadata should be stored in versioned repositories to organize and track model deployments. Model repositories provide the following benefits:

Tracking and evaluation. Track models in production and understand their evaluation and prediction quality metrics.
Model release process. Easily review, approve, release, or roll back models.
Reproducibility and debugging. Reproduce model results and more effectively debug issues by tracing a model's datasets and dependencies across deployments.
Discoverability. Make it easy for others to find your model. Other teams can then determine if your model (or parts of it) can be used for their purposes.

Figure 10 illustrates a validated model stored in a model repository.

Model storage

Store models in a versioned repository

Figure 10. Validated models are stored in a model repository for tracking and discoverability.

Use model cards to document and share key information about your model, like its purpose, architecture, hardware requirements, evaluation metrics, etc.

Check Your Understanding

What are some of the main reasons for using versioned repositories to store predictions, datasets, and models? Select all that apply.

Reproduce and debug issues

Correct. Storing assets in versioned repositories is critical for diagnosing and debugging problems.

Monitor model quality

Correct. Storing assets in versioned repositories is critical for monitoring model quality.

Reduce compute quota

Challenges building pipelines

When building pipelines, you might encounter the following challenges:

Getting access to the data you need. Data access might require justifying why you need it. For example, you might need to explain how the data will be used and clarify how personal identifiable information (PII) issues will be solved. Be prepared to show a proof-of-concept demonstrating how your model makes better predictions with access to certain kinds of data.
Getting the right features. In some cases, the features used in the experimentation phase won't be available from the real-time data. Therefore, when experimenting, try to confirm that you'll be able to get the same features in production.
Understanding how the data is collected and represented. Learning how the data was collected, who collected it, and how it was collected (along with other issues) can take time and effort. It's important to understand the data thoroughly. Don't use data you're not confident in to train a model that might go to production.
Understanding the tradeoffs between effort, cost, and model quality. Incorporating a new feature into a data pipeline can require a lot of effort. However, the additional feature might only slightly improve the model's quality. In other cases, adding a new feature might be easy. However, the resources to get and store the feature might be prohibitively expensive.
Getting compute. If you need TPUs for retraining, it might be hard to get the required quota. Also, managing TPUs is complicated. For example, some parts of your model or data might need to be specifically designed for TPUs by splitting parts of them across multiple TPU chips.
Finding the right golden dataset. If the data changes frequently, getting golden datasets with consistent and accurate labels can be challenging.

Catching these types of problems during experimentation saves time. For example, you don't want to develop the best features and model only to learn they're not viable in production. Therefore, try to confirm as early as possible that your solution will work within the constraints of a production environment. It's better to spend time verifying a solution works rather than needing to return to the experimentation phase because the pipeline phase uncovered insurmountable problems.

Experiments

Productionization

feature engineering	ground truth
offline inference	online inference
prediction bias	test set