Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multi-gpu and replicating models #7737

Open
JoJoLev opened this issue Oct 24, 2024 · 1 comment
Open

Running multi-gpu and replicating models #7737

JoJoLev opened this issue Oct 24, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@JoJoLev
Copy link

JoJoLev commented Oct 24, 2024

Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would have a copy of the model running on each.
Is this possible with NVIDIA triton inference container?

@rmccorm4
Copy link
Collaborator

Hi @JoJoLev, there is a guide for this exact use case here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md. Please let us know if this helps.

@rmccorm4 rmccorm4 self-assigned this Oct 29, 2024
@rmccorm4 rmccorm4 added the question Further information is requested label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants