Production ML systems: Static versus dynamic inference

Inference is the process of making predictions by applying a trained model to unlabeled examples. Broadly speaking, a model can infer predictions in one of two ways:

  • Static inference (also called offline inference or batch inference) means the model makes predictions on a bunch of common unlabeled examples and then caches those predictions somewhere.
  • Dynamic inference (also called online inference or real-time inference) means that the model only makes predictions on demand, for example, when a client requests a prediction.

To use an extreme example, imagine a very complex model that takes one hour to infer a prediction. This would probably be an excellent situation for static inference:

Figure 4. In static inference, a model generates predictions,
            which are then cached on a server.
Figure 4. In static inference, a model generates predictions, which are then cached on a server.

 

Suppose this same complex model mistakenly uses dynamic inference instead of static inference. If many clients request predictions around the same time, most of them won't receive that prediction for hours or days.

Now consider a model that infers quickly, perhaps in 2 milliseconds using a relative minimum of computational resources. In this situation, clients can receive predictions quickly and efficiently through dynamic inference, as suggested in Figure 5.

Figure 5. In dynamic inference, a model infers predictions on
            demand.
Figure 5. In dynamic inference, a model infers predictions on demand.

 

Static inference

Static inference offers certain advantages and disadvantages.

Advantages

  • Don't need to worry much about cost of inference.
  • Can do post-verification of predictions before pushing.

Disadvantages

  • Can only serve cached predictions, so the system might not be able to serve predictions for uncommon input examples.
  • Update latency is likely measured in hours or days.

Dynamic inference

Dynamic inference offers certain advantages and disadvantages.

Advantages

  • Can infer a prediction on any new item as it comes in, which is great for long tail (less common) predictions.

Disadvantages

  • Compute intensive and latency sensitive. This combination may limit model complexity; that is, you might have to build a simpler model that can infer predictions more quickly than a complex model could.
  • Monitoring needs are more intensive.

Exercises: Check your understanding

Which three of the following four statements are true of static inference?
The model must create predictions for all possible inputs.
Yes, the model must make predictions for all possible inputs and store them in a cache or lookup table. If the set of things that the model is predicting is limited, then static inference might be a good choice. However, for free-form inputs like user queries that have a long tail of unusual or rare items, static inference can't provide full coverage.
The system can verify inferred predictions before serving them.
Yes, this is a useful aspect of static inference.
For a given input, the model can serve a prediction more quickly than dynamic inference.
Yes, static inference can almost always serve predictions faster than dynamic inference.
You can react quickly to changes in the world.
No, this is a disadvantage of static inference.
Which one of the following statements is true of dynamic inference?
You can provide predictions for all possible items.
Yes, this is a strength of dynamic inference. Any request that comes in will be given a score. Dynamic inference handles long-tail distributions (those with many rare items), like the space of all possible sentences written in movie reviews.
You can do post-verification of predictions before they are used.
In general, it's not possible to do a post-verification of all predictions before they get used because predictions are being made on demand. You can, however, potentially monitor aggregate prediction qualities to provide some level of quality checking, but these will signal fire alarms only after the fire has already spread.
When performing dynamic inference, you don't need to worry about prediction latency (the lag time for returning predictions) as much as when performing static inference.
Prediction latency is often a real concern in dynamic inference. Unfortunately, you can't necessarily fix prediction latency issues by adding more inference servers.