Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

AntnvSergey · 2024-10-30T15:31:11Z

Description
I converted the onnx model to Tensorrt for two GPUs: rtx 3090 and rtx 4090. I want to load it into the Triton inference server on both gpus (rtx3090 and rtx4090) as one instance.

Triton Information
I use 24.06 Triton container version

I use the glide clip model from https://github.com/openai/glide-text2im

my model config.pbtxt

name: "glide_image_encoder_fp16_rt"
platform: "tensorrt_plan"
max_batch_size : 128
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 64, 64]
  },
  {
    name: "timestep"
    data_type: TYPE_INT64
    dims: [1]
    reshape: { shape: [ ] }
  }
]
output [
  {
    name: "embeddings"
    data_type: TYPE_FP32
    dims: [ 512 ]
  }
]
dynamic_batching { 
  max_queue_delay_microseconds: 100 
  default_queue_policy {
      timeout_action: 0
      default_timeout_microseconds: 15000000
  }
}
cc_model_filenames [
  {
    key: "8.6"
    value: "model_3090.plan"
  },
  {
    key: "8.9"
    value: "model_4090.plan"
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]

while loading the model I get the following error log:

I1030 14:30:44.654570 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 14:30:44.676360 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 14:30:44.676408 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 14:30:44.676413 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 14:30:44.676418 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 14:30:44.682617 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 14:30:45.002033 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.011110 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.075364 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 14:30:45.373381 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.373517 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.433014 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:45.433062 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:45.433085 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:45.433089 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 14:30:45.686459 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 14:30:45.686931 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 14:30:45.687262 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 14:30:45.995272 1 logging.cc:46] "Loaded engine size: 173 MiB"
W1030 14:30:45.998385 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:46.055442 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:46.055483 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:46.055488 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:46.055493 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 14:30:46.088783 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)"
I1030 14:30:46.088827 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 14:30:46.088866 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)

Model successfully loaded only on one of the two GPUs.

my nvidia-smi log:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   30C    P8             21W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   39C    P8             29W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I try to convert my onnx model with --hardwareCompatibilityLevel=ampere+ flag with the hope that it would work on two GPUs at once

/usr/src/tensorrt/bin/trtexec --onnx=./model.onnx --maxShapes=images:128x3x64x64,timestep:128 --minShapes=images:1x3x64x64,timestep:1 --optShapes=images:4x3x64x64,timestep:4 --saveEngine=model_fp16.plan --fp16 --hardwareCompatibilityLevel=ampere+

However, during loading I got another error:

I1030 15:18:14.889380 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 15:18:14.913879 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 15:18:14.913920 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 15:18:14.913925 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 15:18:14.913929 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 15:18:14.920708 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 15:18:15.279233 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.369500 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 15:18:15.705852 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.779282 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:15.779323 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:15.779341 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:15.779346 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 15:18:16.067141 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 15:18:16.067693 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 15:18:16.068063 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 15:18:16.407244 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:16.480263 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:16.480307 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:16.480312 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:16.480316 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 15:18:16.511085 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511135 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 15:18:16.511190 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511488 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1030 15:18:16.546621 1 tensorrt.cc:274] "TRITONBACKEND_ModelFinalize: delete model state"
E1030 15:18:16.559666 1 logging.cc:40] "IRuntime::~IRuntime: Error Code 3: API Usage Error (Parameter check failed, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.)"
E1030 15:18:16.559755 1 model_lifecycle.cc:641] "failed to load 'glide_image_encoder_fp16_rt' version 1: Internal: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.559775 1 model_lifecycle.cc:776] "failed to load 'glide_image_encoder_fp16_rt'"

However, if I add gpus: [ 0 ] or gpus: [ 1 ] line to the group instance, which specifies exactly which GPU to load my weights on, the model loads successfully.

instance_group [
  {
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Expected behavior
I would like the model to successfully load on both GPUs as one instance. I want to maintain the following directory hierarchy:

models/
    glide_image_encoder_fp16_rt/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt

I don't want to create a new instance of the model with an exact indication of which GPU to load the weights on

models/
    glide_image_encoder_fp16_rt_3090/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt
   glide_image_encoder_fp16_rt_4090/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

AntnvSergey commented Oct 30, 2024

Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

Comments

AntnvSergey commented Oct 30, 2024