Production ML systems: Monitoring pipelines

Congratulations! You've deployed the unicorn model. Your model should run 24x7 without any problems. To ensure that it does, you must monitor your machine learning (ML) pipeline.

Write a data schema to validate raw data

To monitor your data, you should continuously check it against expected statistical values by writing rules that the data must satisfy. This collection of rules is called a data schema. Define a data schema by following these steps:

  1. Understand the range and distribution of your features. For categorical features, understand the set of possible values.

  2. Encode your understanding into the data schema. The following are examples of rules:

    • Ensure that user-submitted ratings are always in the range 1 to 5.
    • Check that the word the occurs most frequently (for an English text feature).
    • Check that each categorical feature is set to a value from a fixed set of possible values.
  3. Test your data against the data schema. Your schema should catch data errors such as:

    • Anomalies
    • Unexpected values of categorical variables
    • Unexpected data distributions

Write unit tests to validate feature engineering

While your raw data might pass the data schema, your model doesn't train on raw data. Rather, your model trains on data that has been feature engineered. For example, your model trains on normalized numerical features rather than raw numerical data. Because feature engineered data can be very different from raw input data, you must check feature engineered data separately from your checks on raw input data.

Write unit tests based on your understanding of feature engineered data. For example, you can write unit tests to check for conditions such as the following:

  • All numeric features are scaled, for example, between 0 and 1.
  • One-hot encoded vectors only contain a single 1 and N-1 zeroes.
  • Data distributions after transformation conform to expectations. For example, if you've normalized using Z-scores, the mean of the Z-scores should be 0.
  • Outliers are handled, such as by scaling or clipping.

Check metrics for important data slices

A successful whole sometimes obscures an unsuccessful subset. In other words, a model with great overall metrics might still make terrible predictions for certain situations. For example:

Your unicorn model performs well overall, but performs poorly when making predictions for the Sahara desert.

If you are the kind of engineer satisfied with an overall great AUC, then you might not notice the model's issues in the Sahara desert. If making good predictions for every region is important, then you need to track performance for every region. Subsets of data, like the one corresponding to the Sahara desert, are called data slices.

Identify data slices of interest. Then compare model metrics for these data slices against the metrics for your entire dataset. Checking that your model performs well across all data slices helps remove bias. See Fairness: Evaluating for Bias for more information.

Use real-world metrics

Model metrics don't necessarily measure the real-world impact of your model. For example, changing a hyperparameter might increase a model's AUC, but how did the change affect user experience? To measure real-world impact, you need to define separate metrics. For example, you could survey users of your model to confirm that they really did see a unicorn when the model predicted they would.

Check for training-serving skew

Training-serving skew means your input data during training differs from your input data in serving. The following table describes the two important types of skew:

Type Definition Example Solution
Schema skew Training and serving input data do not conform to the same schema. The format or distribution of the serving data changes while your model continues to train on old data. Use the same schema to validate training and serving data. Ensure you separately check for statistics not checked by your schema, such as the fraction of missing values
Feature skew Engineered data differs between training and serving. Feature engineering code differs between training and serving, producing different engineered data. Similar to schema skew, apply the same statistical rules across training and serving engineered data. Track the number of detected skewed features, and the ratio of skewed examples per feature.

Causes of training-serving skew can be subtle. Always consider what data is available to your model at prediction time. During training, use only the features that you'll have available when serving.

Exercise: Check your understanding

Suppose you have an online store and want to predict how much money you’ll make on a given day. Your ML goal is to predict daily revenue using the number of customers as a feature.

What problem might you encounter?
Click here to see the answer

Check for label leakage

Label leakage means that your ground truth labels that you're trying to predict have inadvertently entered your training features. Label leakage is sometimes very difficult to detect.

Exercise: Check your understanding

Suppose you build a binary classification model to predict whether or not a new hospital patient has cancer. Your model uses features like the following:

  • Patient age
  • Patient gender
  • Prior medical conditions
  • Hospital name
  • Vital signs
  • Test results
  • Heredity

The label is as follows:

  • Boolean: Does the patient have cancer?

You partition the data carefully, ensuring that your training set is well isolated from your validation set and test set. The model performs exceedingly well on the validation set and test set; the metrics are fantastic. Unfortunately, the model performs terribly on new patients in the real world.

Why did this model that excelled on the test set fail miserably in the real world?
Click here to see the answer

Monitor model age throughout pipeline

If the serving data evolves with time but your model isn’t retrained regularly, then you will see a decline in model quality. Track the time since the model was retrained on new data and set a threshold age for alerts. Besides monitoring the model's age at serving, you should monitor the model's age throughout the pipeline to catch pipeline stalls.

Test that model weights and outputs are numerically stable

During model training, your weights and layer outputs should not be NaN (not a number) or Inf (infinite). Write tests to check for NaN and Inf values of your weights and layer outputs. Additionally, test that more than half of the outputs of a layer are not zero.

Monitor model performance

Your unicorn appearance predictor has been more popular than expected! You’re getting lots of prediction requests and even more training data. You think that's great until you realize that your model is taking more and more memory and time to train. You decide to monitor your model's performance by following these steps:

  • Track model performance by versions of code, model, and data. Such tracking lets you pinpoint the exact cause for any performance degradation.
  • Test the training steps per second for a new model version against the previous version and against a fixed threshold.
  • Catch memory leaks by setting a threshold for memory use.
  • Monitor API response times and track their percentiles. While API response times might be outside your control, slow responses could potentially cause poor real-world metrics.
  • Monitor the number of queries answered per second.

Test the quality of live model on served data

You’ve validated your model. But what if real-world scenarios, such as unicorn behavior, change after recording your validation data? Then the quality of your served model will degrade. However, testing quality in serving is hard because real-world data is not always labeled. If your serving data is not labeled, consider these tests:

  • Generate labels using human raters.

  • Investigate models that show significant statistical bias in predictions. See Classification: Prediction Bias.

  • Track real-world metrics for your model. For example, if you’re classifying spam, compare your predictions to user-reported spam.

  • Mitigate potential divergence between training and serving data by serving a new model version on a fraction of your queries. As you validate your new serving model, gradually switch all queries to the new version.

Using these tests, remember to monitor both sudden and slow degradation in prediction quality.

Randomization

Make your data generation pipeline reproducible. Say you want to add a feature to see how it affects model quality. For a fair experiment, your datasets should be identical except for this new feature. In that spirit, make sure any randomization in data generation can be made deterministic:

  • Seed your random number generators (RNGs). Seeding ensures that the RNG outputs the same values in the same order each time you run it, recreating your dataset.
  • Use invariant hash keys. Hashing is a common way to split or sample data. You can hash each example, and use the resulting integer to decide in which split to place the example. The inputs to your hash function shouldn't change each time you run the data generation program. Don't use the current time or a random number in your hash, for example, if you want to recreate your hashes on demand.

The preceding approaches apply both to sampling and splitting your data.

Considerations for hashing

Imagine again you were collecting Search queries and using hashing to include or exclude queries. If the hash key only used the query, then across multiple days of data, you'll either always include that query or always exclude it. Always including or always excluding a query is bad because:

  • Your training set will see a less diverse set of queries.
  • Your evaluation sets will be artificially hard, because they won't overlap with your training data. In reality, at serving time, you'll have seen some of the live traffic in your training data, so your evaluation should reflect that.

Instead you can hash on query + date, which would result in a different hashing each day.

 

Figure 7. Animated visualization showing how hashing solely on the
            query causes data to go into the same bucket each day, but hashing
            on the query plus the query time causes data to go into different
            buckets each day. The three buckets are Training, Evaluation, and
            Ignored.
Figure 7. Hashing on query versus hashing on query + query time.