Custom Batcher - more complicated batching #5497
-
Hi folks, I have a Python backend that works as a dispatcher to other models in the model repository. Since I'm working with a memory-constrained system, the backend also loads and unloads models from memory using the model management extension. If I batch together requests destined for the same model (deterministic based on input), I could reduce the latency associated with the loading and offloading of models. Say the maximum batch size my models accept is 64. My understanding is that the dynamic batcher (which sits in front of my dispatcher) with default settings keeps appending requests in order to a "pending batch". When it decides that a request cannot be added, it finalizes the batch, sends it to the model, and starts a new batch starting at the last unbatched request. I'd like to be able to wait until a certain number (> max_batch_size) of requests arrives or some delay expires and then given all of the requests that have been queued thus far, create batches s.t. each batch maximizes the overlap of models that it will be dispatched to. Concretely, given Is this achievable through the custom batcher API or will I have to get creative and implement this on the backend, perhaps by increasing the batch size the dispatcher accepts and manually making smaller batches at the backend? Many thanks for any pointers! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You can set a batch delay for batching. That will wait to send a batch until a preferred or max batch size is achieved. If none is achieved by the delay, it will send the current batch. That should take care of your timeout request. Since the custom batcher works on top of the dynamic batcher, that will take care of that behavior. If you want to add behavior on top to send if a minimum batch size has been received, you can do that via the custom batcher API as you mentioned. However, I don't fully understand how the request routing for your backend works. It sounds like it's complicated and happens outside the dynamic batcher? That may be the only wrinkle. If you're scheduling these requests for models already loaded, then you don't need to worry about which model each request request goes to (and there's no benefit, as the batching is done on a per-model basis). However, it sounds like your backend tracks requests and loads/unloads models on them internally. In that case, I think your custom logic would be happening outside of the scheduler. It sounds like you're forwarding the requests to the scheduler (since you'd otherwise get errors for models that aren't loaded), in which case you'd need to extend your logic to cover the cases you describe. The custom batcher works on top of the dynamic batcher and therefore scheduler, so you'd actually need to send the requests to the server with the model already specified. |
Beta Was this translation helpful? Give feedback.
You can set a batch delay for batching. That will wait to send a batch until a preferred or max batch size is achieved. If none is achieved by the delay, it will send the current batch.
That should take care of your timeout request. Since the custom batcher works on top of the dynamic batcher, that will take care of that behavior. If you want to add behavior on top to send if a minimum batch size has been received, you can do that via the custom batcher API as you mentioned.
However, I don't fully understand how the request routing for your backend works. It sounds like it's complicated and happens outside the dynamic batcher? That may be the only wrinkle. If you're scheduling these reque…