Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeamlessM4T on triton #7740

Open
Interwebart opened this issue Oct 24, 2024 · 0 comments
Open

SeamlessM4T on triton #7740

Interwebart opened this issue Oct 24, 2024 · 0 comments

Comments

@Interwebart
Copy link

Is your feature request related to a problem? Please describe.
It's unclear how to efficiently work with models like SeamlessM4T through Triton Inference Server. There is limited documentation and resources on setting up parallel, multi-threaded inference for such multimodal models in production environments.

Describe the solution you'd like
I would like to see a detailed guide or native support in Triton for deploying SeamlessM4T with optimized multi-threaded inference on multiple GPUs. This would enable parallel processing of streams in real-time, similar to how it is done for models like Llama or gpt.

Describe alternatives you've considered
I've considered using custom pipelines and batching mechanisms, but these approaches are often inefficient or lack sufficient support for this model, leading to increased latency and resource overhead.

Additional context
This feature would be beneficial for production environments where real-time processing and scalable deployment are critical. Providing sample configurations or scripts for Triton users working with multimodal models would be very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant