Running multi-gpu and replicating models #7737

JoJoLev · 2024-10-24T04:58:58Z

Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would have a copy of the model running on each.
Is this possible with NVIDIA triton inference container?

rmccorm4 · 2024-10-29T23:15:33Z

Hi @JoJoLev, there is a guide for this exact use case here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md. Please let us know if this helps.

rmccorm4 self-assigned this Oct 29, 2024

rmccorm4 added the question Further information is requested label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multi-gpu and replicating models #7737

Running multi-gpu and replicating models #7737

JoJoLev commented Oct 24, 2024

rmccorm4 commented Oct 29, 2024

Running multi-gpu and replicating models #7737

Running multi-gpu and replicating models #7737

Comments

JoJoLev commented Oct 24, 2024

rmccorm4 commented Oct 29, 2024