Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ [Question] How to decrease the latency of the inference? #2082

Closed
willdone1337 opened this issue Jul 7, 2023 · 2 comments
Closed

❓ [Question] How to decrease the latency of the inference? #2082

willdone1337 opened this issue Jul 7, 2023 · 2 comments
Assignees

Comments

@willdone1337
Copy link

❓ Question

Hi. I convert pytorch retinaface and arcface model to TensorRT via torch_tensorrt library. Everything is okay but after some iterations inference is freezing and the time for handling the image is badly increased (>10x).
Snippet of inference simulation is here:

Environment

TensorRT Version: 8.4.2
GPU Type: A100
Nvidia Driver Version: 465.19.01
CUDA Version: 11.3
CUDNN Version: 8
Operating System + Version: SLES “15-SP2” in host machine
Python Version (if applicable): 3.8
PyTorch Version (if applicable): 1.13.0a0+d321be6
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/pytorch:22.08-py3

Code


import torch
import torch_tensorrt
import time

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'


retinaface_model = torch.jit.load('../jit_retinaface_trt.torch-tensorrt') 
retinaface_model.eval()
retinaface_model.to(DEVICE)


arcface_model = torch.jit.load('../arcface_bs1_torch.float32.torch-tensorrt')
arcface_model.eval()
arcface_model.to(DEVICE)

retinaface_tensor = torch.rand(1, 3, 360, 640).to(DEVICE)
arcface_tensor = torch.rand(1, 3, 112, 112).to(DEVICE)

for _ in range(100):
    global_start = time.time()
    start_time = time.time()
    with torch.no_grad():
        ret_out = retinaface_model(retinaface_tensor)
    torch.cuda.synchronize()
    end_time = time.time()
    ret_time = end_time - start_time
    start_time = time.time()
    with torch.no_grad():
        arc_out = arcface_model(arcface_tensor)
    torch.cuda.synchronize()
    end_time = time.time()
    arc_time = end_time - start_time
    global_end = time.time()
    global_time = global_end - global_start
    # if global_time > 0.1:
    print(f'ret time is : {ret_time}')
    print(f'arc time is : {arc_time}')
    print(f'global time is : {global_end-global_start}')
    print('-'*40)

Outputs

Outputs:
Normally output is like this:
ret time is : 0.0009617805480957031
arc time is : 0.0019981861114501953
global time is : 0.002961874008178711
ret time is : 0.0008959770202636719
arc time is : 0.0019989013671875
global time is : 0.002896547317504883
ret time is : 0.0009148120880126953
arc time is : 0.0020008087158203125
global time is : 0.0029172897338867188
ret time is : 0.0008985996246337891
arc time is : 0.001995086669921875
global time is : 0.002894878387451172
ret time is : 0.00446009635925293
arc time is : 0.002003192901611328
global time is : 0.006464719772338867
ret time is : 0.0009562969207763672
arc time is : 0.0020017623901367188
global time is : 0.0029592514038085938
ret time is : 0.0009098052978515625
arc time is : 0.002006053924560547
global time is : 0.002917051315307617
ret time is : 0.0009250640869140625
arc time is : 0.001997709274291992
global time is : 0.002924203872680664
ret time is : 0.0009291172027587891
arc time is : 0.001995086669921875
global time is : 0.002925395965576172
ret time is : 0.0009377002716064453
arc time is : 0.0020194053649902344
global time is : 0.0029582977294921875
ret time is : 0.0009005069732666016
arc time is : 0.0019958019256591797
global time is : 0.0028977394104003906
ret time is : 0.0009152889251708984
arc time is : 0.001996755599975586
global time is : 0.0029134750366210938
ret time is : 0.0009534358978271484
arc time is : 0.0019991397857666016
global time is : 0.0029540061950683594
ret time is : 0.0009467601776123047
arc time is : 0.0020117759704589844
global time is : 0.002960205078125
ret time is : 0.0008974075317382812
arc time is : 0.0019989013671875
global time is : 0.0028977394104003906
ret time is : 0.0009267330169677734
arc time is : 0.002001523971557617
global time is : 0.0029296875

But after some iterations and time return this:

ret time is : 0.0030410289764404297
arc time is : 0.10997724533081055 <-----
global time is : 0.11302065849304199
ret time is : 0.002657651901245117
arc time is : 0.1075441837310791 <-----
global time is : 0.11020350456237793
ret time is : 0.1104578971862793 <-----
arc time is : 0.0020885467529296875
global time is : 0.1125497817993164
ret time is : 0.11419057846069336 <-----
arc time is : 0.0020301342010498047
global time is : 0.11622214317321777
ret time is : 0.10733747482299805 <-----
arc time is : 0.0020294189453125
global time is : 0.10936880111694336
ret time is : 0.1150820255279541 <-----
arc time is : 0.0020606517791748047
global time is : 0.11714410781860352

I try changing the clock freq to the max of A100(1410MHz) but nothing changes from the default(765MHz).
In real-time handling after 26-28 iterations this happens.
It will be great if you support fixing this. Thanks in advance!!!

@willdone1337 willdone1337 added the question Further information is requested label Jul 7, 2023
@gs-olive
Copy link
Collaborator

gs-olive commented Jul 7, 2023

Thank you for the question. @bowang007 - this may be related to your recent performance work. Could you take a look?

@github-actions
Copy link

github-actions bot commented Oct 6, 2023

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants