Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to simultaneously load TensorRT model.plan on different GPUs in Triton Inference Server in the same instance #7755

Open
AntnvSergey opened this issue Oct 30, 2024 · 0 comments

Comments

@AntnvSergey
Copy link

Description
I converted the onnx model to Tensorrt for two GPUs: rtx 3090 and rtx 4090. I want to load it into the Triton inference server on both gpus (rtx3090 and rtx4090) as one instance.

Triton Information
I use 24.06 Triton container version

I use the glide clip model from https://github.com/openai/glide-text2im

my model config.pbtxt

name: "glide_image_encoder_fp16_rt"
platform: "tensorrt_plan"
max_batch_size : 128
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 64, 64]
  },
  {
    name: "timestep"
    data_type: TYPE_INT64
    dims: [1]
    reshape: { shape: [ ] }
  }
]
output [
  {
    name: "embeddings"
    data_type: TYPE_FP32
    dims: [ 512 ]
  }
]
dynamic_batching { 
  max_queue_delay_microseconds: 100 
  default_queue_policy {
      timeout_action: 0
      default_timeout_microseconds: 15000000
  }
}
cc_model_filenames [
  {
    key: "8.6"
    value: "model_3090.plan"
  },
  {
    key: "8.9"
    value: "model_4090.plan"
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]

while loading the model I get the following error log:

I1030 14:30:44.654570 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 14:30:44.676360 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 14:30:44.676408 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 14:30:44.676413 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 14:30:44.676418 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 14:30:44.682617 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 14:30:45.002033 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.011110 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.075364 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 14:30:45.373381 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.373517 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.433014 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:45.433062 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:45.433085 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:45.433089 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 14:30:45.686459 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 14:30:45.686931 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 14:30:45.687262 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 14:30:45.995272 1 logging.cc:46] "Loaded engine size: 173 MiB"
W1030 14:30:45.998385 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:46.055442 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:46.055483 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:46.055488 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:46.055493 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 14:30:46.088783 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)"
I1030 14:30:46.088827 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 14:30:46.088866 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)

Model successfully loaded only on one of the two GPUs.

my nvidia-smi log:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   30C    P8             21W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:02:00.0 Off |                  N/A |
|  0%   39C    P8             29W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I try to convert my onnx model with --hardwareCompatibilityLevel=ampere+ flag with the hope that it would work on two GPUs at once

/usr/src/tensorrt/bin/trtexec --onnx=./model.onnx --maxShapes=images:128x3x64x64,timestep:128 --minShapes=images:1x3x64x64,timestep:1 --optShapes=images:4x3x64x64,timestep:4 --saveEngine=model_fp16.plan --fp16 --hardwareCompatibilityLevel=ampere+

However, during loading I got another error:

I1030 15:18:14.889380 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 15:18:14.913879 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 15:18:14.913920 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 15:18:14.913925 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 15:18:14.913929 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 15:18:14.920708 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 15:18:15.279233 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.369500 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 15:18:15.705852 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.779282 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:15.779323 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:15.779341 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:15.779346 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 15:18:16.067141 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 15:18:16.067693 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 15:18:16.068063 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 15:18:16.407244 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:16.480263 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:16.480307 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:16.480312 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:16.480316 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 15:18:16.511085 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511135 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 15:18:16.511190 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511488 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1030 15:18:16.546621 1 tensorrt.cc:274] "TRITONBACKEND_ModelFinalize: delete model state"
E1030 15:18:16.559666 1 logging.cc:40] "IRuntime::~IRuntime: Error Code 3: API Usage Error (Parameter check failed, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.)"
E1030 15:18:16.559755 1 model_lifecycle.cc:641] "failed to load 'glide_image_encoder_fp16_rt' version 1: Internal: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.559775 1 model_lifecycle.cc:776] "failed to load 'glide_image_encoder_fp16_rt'"

However, if I add gpus: [ 0 ] or gpus: [ 1 ] line to the group instance, which specifies exactly which GPU to load my weights on, the model loads successfully.

instance_group [
  {
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Expected behavior
I would like the model to successfully load on both GPUs as one instance. I want to maintain the following directory hierarchy:

models/
    glide_image_encoder_fp16_rt/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt

I don't want to create a new instance of the model with an exact indication of which GPU to load the weights on

models/
    glide_image_encoder_fp16_rt_3090/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt
   glide_image_encoder_fp16_rt_4090/
        1
            model_3090.plan
            model_4090.plan
        config.pbtxt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant