SeamlessM4T on triton #7740

Interwebart · 2024-10-24T16:23:46Z

Is your feature request related to a problem? Please describe.
It's unclear how to efficiently work with models like SeamlessM4T through Triton Inference Server. There is limited documentation and resources on setting up parallel, multi-threaded inference for such multimodal models in production environments.

Describe the solution you'd like
I would like to see a detailed guide or native support in Triton for deploying SeamlessM4T with optimized multi-threaded inference on multiple GPUs. This would enable parallel processing of streams in real-time, similar to how it is done for models like Llama or gpt.

Describe alternatives you've considered
I've considered using custom pipelines and batching mechanisms, but these approaches are often inefficient or lack sufficient support for this model, leading to increased latency and resource overhead.

Additional context
This feature would be beneficial for production environments where real-time processing and scalable deployment are critical. Providing sample configurations or scripts for Triton users working with multimodal models would be very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SeamlessM4T on triton #7740

SeamlessM4T on triton #7740

Interwebart commented Oct 24, 2024

SeamlessM4T on triton #7740

SeamlessM4T on triton #7740

Comments

Interwebart commented Oct 24, 2024