Custom Batcher - more complicated batching #5497

MichalPitr · 2023-03-13T20:05:58Z

MichalPitr
Mar 13, 2023

Hi folks,

I have a Python backend that works as a dispatcher to other models in the model repository. Since I'm working with a memory-constrained system, the backend also loads and unloads models from memory using the model management extension. If I batch together requests destined for the same model (deterministic based on input), I could reduce the latency associated with the loading and offloading of models.

Say the maximum batch size my models accept is 64. My understanding is that the dynamic batcher (which sits in front of my dispatcher) with default settings keeps appending requests in order to a "pending batch". When it decides that a request cannot be added, it finalizes the batch, sends it to the model, and starts a new batch starting at the last unbatched request.

I'd like to be able to wait until a certain number (> max_batch_size) of requests arrives or some delay expires and then given all of the requests that have been queued thus far, create batches s.t. each batch maximizes the overlap of models that it will be dispatched to.

Concretely, given requests=[1, 2, 1, 2, 1, 2], where requests[i] denotes which model the ith request will be dispatched to, and max_batch_size=3, I'd like this queue of requests to result in two batches, batch1=[1, 1, 1] and batch2=[2, 2, 2]. This way, for each batch, I'd have to do at most one unload and load.

Is this achievable through the custom batcher API or will I have to get creative and implement this on the backend, perhaps by increasing the batch size the dispatcher accepts and manually making smaller batches at the backend?

Many thanks for any pointers!

Answered by dyastremsky

Mar 13, 2023

You can set a batch delay for batching. That will wait to send a batch until a preferred or max batch size is achieved. If none is achieved by the delay, it will send the current batch.

That should take care of your timeout request. Since the custom batcher works on top of the dynamic batcher, that will take care of that behavior. If you want to add behavior on top to send if a minimum batch size has been received, you can do that via the custom batcher API as you mentioned.

However, I don't fully understand how the request routing for your backend works. It sounds like it's complicated and happens outside the dynamic batcher? That may be the only wrinkle. If you're scheduling these reque…

View full answer

dyastremsky · 2023-03-13T22:45:17Z

dyastremsky
Mar 13, 2023
Collaborator

You can set a batch delay for batching. That will wait to send a batch until a preferred or max batch size is achieved. If none is achieved by the delay, it will send the current batch.

That should take care of your timeout request. Since the custom batcher works on top of the dynamic batcher, that will take care of that behavior. If you want to add behavior on top to send if a minimum batch size has been received, you can do that via the custom batcher API as you mentioned.

However, I don't fully understand how the request routing for your backend works. It sounds like it's complicated and happens outside the dynamic batcher? That may be the only wrinkle. If you're scheduling these requests for models already loaded, then you don't need to worry about which model each request request goes to (and there's no benefit, as the batching is done on a per-model basis). However, it sounds like your backend tracks requests and loads/unloads models on them internally. In that case, I think your custom logic would be happening outside of the scheduler. It sounds like you're forwarding the requests to the scheduler (since you'd otherwise get errors for models that aren't loaded), in which case you'd need to extend your logic to cover the cases you describe. The custom batcher works on top of the dynamic batcher and therefore scheduler, so you'd actually need to send the requests to the server with the model already specified.

1 reply

MichalPitr Mar 14, 2023
Author

Many thanks, @dyastremsky!

The way my setup works is I have a "router model", which is just a python backend that acts as the single entry point. It inspects each request, determines which PyTorch model it should be sent to (potentially loading the model, if not loaded already), makes an inference request, and returns the result back to the router's caller.

In practice, multiple requests might be routed to the same pytorch model, which lets me amortize the cost of loading the given model. I'm only using a dynamic batcher with the router model and no batchers for the pytorch models. For the pytorch models, I'm relying on the router to stack requests routed to the same pytorch model for parallelism.

The reason I'm using this router approach is that to which model a request will be sent is determined by a classification neural net, so the original caller doesn't know. Also, the model might not be loaded, so the router also takes care of loading/unloading models as demanded.

I've noticed that loading and unloading models is significantly affecting throughput, so I'm trying how to reduce the loading/offloading overhead by constructing batches that activate the same models, rather than batches given by request arrival order.

I'll see what I can do on the backend. Thanks for input!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Batcher - more complicated batching #5497

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Custom Batcher - more complicated batching #5497

MichalPitr Mar 13, 2023

Replies: 1 comment · 1 reply

dyastremsky Mar 13, 2023 Collaborator

MichalPitr Mar 14, 2023 Author

MichalPitr
Mar 13, 2023

Replies: 1 comment 1 reply

dyastremsky
Mar 13, 2023
Collaborator

MichalPitr Mar 14, 2023
Author