2023-01-07

ML serving patterns

Serving patterns for ML systems

There are various ways of serving inference in machine learning systems. In this article, I introduce the following serving pattern for ML systems, which is described in Mercari's GitHub.

Web-single pattern
Synchronous pattern
Asynchronous pattern
Batch pattern
Prep-pred pattern
Serial microservices pattern
Parallel microservices pattern
Inference cache pattern
Data cache pattern

Web-single pattern

The Web-single pattern bundles models with a web server. By installing a REST interface (or GRPC), preprocessing, and trained models on the same server, a simple inference server can be generated.

Web single pattern

Pros
- Web, preprocessing, and inference can be implemented in the same programming language
- Simple structure and easy operation
Cons
- Components (web, preprocessing, inference) are enclosed in the same container image, so components cannot be updated individually
Use cases
- Simple configuration for quick release of inference server
- When only one model returns inferences synchronously

Synchronous pattern

In the synchronous pattern, the web server and the inference server are separated, and inference is handled synchronously. The client waits until it receives a response to its inference request.

Synchronous pattern

Pros
- Simple structure and easy to operate
- Clients do not move on to the next process until inference is complete, making it easy to think about workflows
Cons
- Inference tends to be a performance bottleneck
- Inference waits, so it is necessary to consider ways to avoid degrading the user experience during this time
Use cases
- When the workflow of an application is such that it cannot proceed to the next step until the inference result is obtained
- When the workflow depends on the inference result

Asynchronous pattern

In the Asynchronous pattern, a queue or cache is placed between the request and inference, and the inference request and inference result are obtained asynchronously. In this pattern, by separating request and inference, there is no need to wait for inference time in the client workflow.

Asynchronous pattern 1

Pros
- Can decouple inference from the client
- Less client impact even when inference waits for long periods of time
Cons
- Requires queues and caches
- Not suitable for real-time processing
Use cases
- For workflows in which the immediate post-processing behavior does not depend on the inference result
- Separate client and output destination for inference results

Batch pattern

When inference does not need to be performed in real-time, a batch job can be started to perform inference periodically. In this pattern, accumulated data can be inferred periodically, such as at night, and the results can be saved.

Batch pattern

Pros
- Flexible server resource management
- If inference fails for some reason, it can be done again.
Cons
- Job management server is needed.
Use cases
- When real-time or quasi-real-time inference is not required
- When you want to perform summary inference on historical data
  You want to infer data periodically, such as at night, hourly, monthly, etc.

Prep-pred pattern

Preprocessing and inference may require different resources and libraries. In such cases, you can split servers or containers between preprocessors and inferencers to streamline development and operation. While dividing preprocessors and inferencers allows for efficient resource utilization, individual development, and fault isolation, it also requires tuning of each resource, mutual network design, and versioning. This pattern is often used when deep learning is used as the inferencers.

Prep-pred pattern 1
Simple prep-pred pattern

Prep-pred pattern 2
Microservice prep-pred pattern

Pros
- Enables fault isolation
- Allows flexibility in selecting libraries and increasing or decreasing resources
Cons
- Increased complexity of managed servers and network configuration, resulting in increased operational load
- Network between preprocessing server and inference server may become a bottleneck
Use cases
- Libraries, code base, and resource load are different between preprocessing and inference
- When you want to improve availability by splitting up preprocessing and inference

Serial microservices pattern

If you want to run multiple inference models in sequence, deploy multiple machine learning models in parallel on separate servers. Send inference requests to each in turn, and get inference results after aggregating the last inference.

Microservice vertical pattern 1

Pros
- Each inference model can be executed in turn
- Can be configured to select the inference request destination for the next model depending on the results of the previous inference model
- By dividing the server and codebase for each inference, resource efficiency and fault isolation can be improved.
Cons
- Multiple inferences are executed in sequence, which may increase the time required.
- If the result of the previous inference is not obtained, the next inference cannot be executed, creating a bottleneck
- System configuration may become more complex
use cases
- When multiple inferences are performed for a single operation
- When there is an execution order or dependency between multiple inferences

Parallel microservices pattern

In the parallel microservice pattern, inference requests are sent to each inference server in parallel to obtain multiple inference results.

Microservice horizontal pattern 1
Synchronized horizontal

Microservice horizontal pattern 3
Asynchronized horizontal

Pros
- Splitting the inference server allows for resource coordination and fault isolation
- Flexible system construction without dependencies among inference workflows
Cons
- System can be complex due to running multiple inferences
- When running synchronously, the time required depends on the slowest inference
- Asynchronous execution requires a later workflow to compensate for time differences between inference servers.
Use cases
- Multiple inferences without dependencies are executed in parallel
- In the case of a workflow where multiple inference results are aggregated at the end

Inference cache pattern

In the inference cache pattern, inference results are stored in a cache, and inferences on the same data are searched in the cache.

Prediction cache pattern

Pros
- Can improve inference speed and offload load to the inference server
Cons
- Incurs cache server costs (many caches tend to cost more and have smaller capacity than storage)
- Need to consider cache clearing policy
Use cases
- Frequent inference requests for the same data
- For input data that can be searched by cache keys
- When you want to process inference at high speed

Data cache pattern

The data cache pattern caches data. Contents such as images and text tend to have large data size. If the size of input data or preprocessing data is large, data cache can reduce the load of data acquisition.

Data cache pattern 1
Input data cache

Data cache pattern 2
Preprocessed data cache

Pros
- Can reduce data acquisition and preprocessing overhead
- Fast inference startup
Cons
- Cache server cost is incurred
- Need to consider cache clearing policy
Use cases
- Workflow with inference requests for the same data
- For repeated input data of the same data
- For input data that can be searched by cache keys
- When you want to speed up data processing

References

ML model delivery patterns

ML QA patterns

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS