Inference Batcher

This batcher is implemented in the KServe model agent sidecar, so the requests first hit the agent sidecar, when a batch prediction is triggered the request is then sent to the model server container for inference.

We use webhook to inject the model agent container in the InferenceService pod to do the batching when batcher is enabled.
We use go channels to transfer data between http requset handler and batcher go routines.
When the number of instances (For example, the number of pictures) reaches the or the latency meets the maxLatency, a batch prediction will be triggered.

yaml

maxBatchSize: the max batch size for triggering a prediction.
: the max latency for triggering a prediction (In milliseconds).

All of the bellowing fields have default values in the code. You can config them or not as you wish.

maxBatchSize: 32.
: 60.

kubectl

We can now send requests to the pytorch model using hey. The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

The request will go to the model agent container first, the batcher in sidecar container batches the requests and send the inference request to the predictor container.

Note

If the interval of sending the two requests is less than maxLatency, the returned will be the same.