You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
It's unclear how to efficiently work with models like SeamlessM4T through Triton Inference Server. There is limited documentation and resources on setting up parallel, multi-threaded inference for such multimodal models in production environments.
Describe the solution you'd like
I would like to see a detailed guide or native support in Triton for deploying SeamlessM4T with optimized multi-threaded inference on multiple GPUs. This would enable parallel processing of streams in real-time, similar to how it is done for models like Llama or gpt.
Describe alternatives you've considered
I've considered using custom pipelines and batching mechanisms, but these approaches are often inefficient or lack sufficient support for this model, leading to increased latency and resource overhead.
Additional context
This feature would be beneficial for production environments where real-time processing and scalable deployment are critical. Providing sample configurations or scripts for Triton users working with multimodal models would be very helpful.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
It's unclear how to efficiently work with models like SeamlessM4T through Triton Inference Server. There is limited documentation and resources on setting up parallel, multi-threaded inference for such multimodal models in production environments.
Describe the solution you'd like
I would like to see a detailed guide or native support in Triton for deploying SeamlessM4T with optimized multi-threaded inference on multiple GPUs. This would enable parallel processing of streams in real-time, similar to how it is done for models like Llama or gpt.
Describe alternatives you've considered
I've considered using custom pipelines and batching mechanisms, but these approaches are often inefficient or lack sufficient support for this model, leading to increased latency and resource overhead.
Additional context
This feature would be beneficial for production environments where real-time processing and scalable deployment are critical. Providing sample configurations or scripts for Triton users working with multimodal models would be very helpful.
The text was updated successfully, but these errors were encountered: