You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I converted the onnx model to Tensorrt for two GPUs: rtx 3090 and rtx 4090. I want to load it into the Triton inference server on both gpus (rtx3090 and rtx4090) as one instance.
Triton Information
I use 24.06 Triton container version
while loading the model I get the following error log:
I1030 14:30:44.654570 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 14:30:44.676360 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 14:30:44.676408 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 14:30:44.676413 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 14:30:44.676418 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 14:30:44.682617 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 14:30:45.002033 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.011110 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.075364 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 14:30:45.373381 1 logging.cc:46] "Loaded engine size: 171 MiB"
W1030 14:30:45.373517 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:45.433014 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:45.433062 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:45.433085 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:45.433089 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 14:30:45.686459 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 14:30:45.686931 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 14:30:45.687262 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 14:30:45.995272 1 logging.cc:46] "Loaded engine size: 173 MiB"
W1030 14:30:45.998385 1 logging.cc:43] "Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors."
I1030 14:30:46.055442 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 14:30:46.055483 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 14:30:46.055488 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 14:30:46.055493 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 14:30:46.088783 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)"
I1030 14:30:46.088827 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 14:30:46.088866 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([version.cpp:operator():80] Compiled assuming that device 0 was SM 86, but device 0 is SM 89.)
Model successfully loaded only on one of the two GPUs.
my nvidia-smi log:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 30C P8 21W / 450W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:02:00.0 Off | N/A |
| 0% 39C P8 29W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I try to convert my onnx model with --hardwareCompatibilityLevel=ampere+ flag with the hope that it would work on two GPUs at once
I1030 15:18:14.889380 1 model_lifecycle.cc:472] "loading: glide_image_encoder_fp16_rt:1"
I1030 15:18:14.913879 1 tensorrt.cc:65] "TRITONBACKEND_Initialize: tensorrt"
I1030 15:18:14.913920 1 tensorrt.cc:75] "Triton TRITONBACKEND API version: 1.19"
I1030 15:18:14.913925 1 tensorrt.cc:81] "'tensorrt' TRITONBACKEND API version: 1.19"
I1030 15:18:14.913929 1 tensorrt.cc:105] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1030 15:18:14.920708 1 tensorrt.cc:231] "TRITONBACKEND_ModelInitialize: glide_image_encoder_fp16_rt (version 1)"
I1030 15:18:15.279233 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.369500 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 0)"
I1030 15:18:15.705852 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:15.779282 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:15.779323 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:15.779341 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:15.779346 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
I1030 15:18:16.067141 1 logging.cc:46] "[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +607, now: CPU 0, GPU 772 (MiB)"
I1030 15:18:16.067693 1 instance_state.cc:186] "Created instance glide_image_encoder_fp16_rt_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];"
I1030 15:18:16.068063 1 tensorrt.cc:297] "TRITONBACKEND_ModelInstanceInitialize: glide_image_encoder_fp16_rt_0_0 (GPU device 1)"
I1030 15:18:16.407244 1 logging.cc:46] "Loaded engine size: 179 MiB"
I1030 15:18:16.480263 1 logging.cc:46] "[MS] Running engine with multi stream info"
I1030 15:18:16.480307 1 logging.cc:46] "[MS] Number of aux streams is 2"
I1030 15:18:16.480312 1 logging.cc:46] "[MS] Number of total worker streams is 3"
I1030 15:18:16.480316 1 logging.cc:46] "[MS] The main stream provided by execute/enqueue calls is the first worker stream"
E1030 15:18:16.511085 1 logging.cc:40] "ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511135 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
E1030 15:18:16.511190 1 backend_model.cc:692] "ERROR: Failed to create instance: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.511488 1 tensorrt.cc:353] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1030 15:18:16.546621 1 tensorrt.cc:274] "TRITONBACKEND_ModelFinalize: delete model state"
E1030 15:18:16.559666 1 logging.cc:40] "IRuntime::~IRuntime: Error Code 3: API Usage Error (Parameter check failed, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.)"
E1030 15:18:16.559755 1 model_lifecycle.cc:641] "failed to load 'glide_image_encoder_fp16_rt' version 1: Internal: unable to create TensorRT context: ICudaEngine::createExecutionContext: Error Code 1: Myelin ([cudamod.cpp:CUDAMod:32] CUDA error 300 loading a module.)"
I1030 15:18:16.559775 1 model_lifecycle.cc:776] "failed to load 'glide_image_encoder_fp16_rt'"
However, if I add gpus: [ 0 ] or gpus: [ 1 ] line to the group instance, which specifies exactly which GPU to load my weights on, the model loads successfully.
instance_group [
{
kind: KIND_GPU
gpus: [ 0 ]
}
]
Expected behavior
I would like the model to successfully load on both GPUs as one instance. I want to maintain the following directory hierarchy:
Description
I converted the onnx model to Tensorrt for two GPUs: rtx 3090 and rtx 4090. I want to load it into the Triton inference server on both gpus (rtx3090 and rtx4090) as one instance.
Triton Information
I use 24.06 Triton container version
I use the glide clip model from https://github.com/openai/glide-text2im
my model config.pbtxt
while loading the model I get the following error log:
Model successfully loaded only on one of the two GPUs.
my nvidia-smi log:
I try to convert my onnx model with --hardwareCompatibilityLevel=ampere+ flag with the hope that it would work on two GPUs at once
However, during loading I got another error:
However, if I add
gpus: [ 0 ]
orgpus: [ 1 ]
line to the group instance, which specifies exactly which GPU to load my weights on, the model loads successfully.Expected behavior
I would like the model to successfully load on both GPUs as one instance. I want to maintain the following directory hierarchy:
I don't want to create a new instance of the model with an exact indication of which GPU to load the weights on
The text was updated successfully, but these errors were encountered: