Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug] Detectron2 with TensorRT is slower than vanilla Detectron2 #2098

Closed
fumin opened this issue Jul 11, 2023 · 8 comments
Closed

🐛 [Bug] Detectron2 with TensorRT is slower than vanilla Detectron2 #2098

fumin opened this issue Jul 11, 2023 · 8 comments
Labels
bug Something isn't working No Activity

Comments

@fumin
Copy link

fumin commented Jul 11, 2023

Description

Detectron2 with TensorRT is slower than vanilla Detectron2 out of the box

Environment

TensorRT Version: 8.6.1.6-1+cuda11.8

NVIDIA GPU: NVIDIA GeForce RTX 3060

NVIDIA Driver Version: Driver Version: 530.41.03

CUDA Version: 11.8

CUDNN Version: 8.9.2

Operating System:

Python Version (if applicable): 3.8.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.0.1+cu118

Baremetal or Container (if so, version):

Relevant Files

track.zip

Model link:

Steps To Reproduce

Unzip the file above which result in a single reproducible python script.
Run it with python track.py.
By default it does not use TensorRT, and it will print:

topunion@topunion-MS-7C96:~/a/count$ CUDA_MODULE_LOADING=LAZY python track.py 
2023-07-09 19:28:23.694 /usr/local/detectron2/detectron2/checkpoint/detection_checkpoint.py:38 [DetectionCheckpointer] Loading from model_best.pth ...
2023-07-09 19:28:23.694 /home/topunion/.local/lib/python3.8/site-packages/fvcore/common/checkpoint.py:150 [Checkpointer] Loading from model_best.pth ...
/home/topunion/.local/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2023-07-09 19:28:26.304 track.py:579 fps 9.435345
2023-07-09 19:28:28.331 track.py:579 fps 10.313857
2023-07-09 19:28:30.330 track.py:579 fps 10.481426
2023-07-09 19:28:32.308 track.py:579 fps 10.584270
2023-07-09 19:28:34.293 track.py:579 fps 10.557841
2023-07-09 19:28:36.281 track.py:579 fps 10.538757
2023-07-09 19:28:38.255 track.py:579 fps 10.603043
2023-07-09 19:28:40.212 track.py:579 fps 10.709509

Now comment out "track.py" 504 line, which essentially compiles the model with TensorRT, and runs the compiled TensorRT model.
It will print something like:

topunion@topunion-MS-7C96:~/a/count$ CUDA_MODULE_LOADING=LAZY python track.py 
2023-07-09 19:39:21.380 /usr/local/detectron2/detectron2/checkpoint/detection_checkpoint.py:38 [DetectionCheckpointer] Loading from model_best.pth ...
2023-07-09 19:39:21.380 /home/topunion/.local/lib/python3.8/site-packages/fvcore/common/checkpoint.py:150 [Checkpointer] Loading from model_best.pth ...
/home/topunion/.local/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2023-07-09 19:39:26.612 track.py:579 fps 4.691342
2023-07-09 19:39:28.875 track.py:579 fps 9.199055
2023-07-09 19:39:31.101 track.py:579 fps 9.364761
2023-07-09 19:39:33.329 track.py:579 fps 9.342092
2023-07-09 19:39:35.569 track.py:579 fps 9.312849
2023-07-09 19:39:37.828 track.py:579 fps 9.233296

As you can see, whereas vanilla Pytorch runs at 10.5fps, TensorRT runs at only 9.3fps!
How could compiled TensorRT be slower than barebones Pytorch!?

Commands or scripts:

Have you tried the latest release?:

Yes, mine is the latest version.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

Yes, as far as I know, torch_tensorrt utilizes onnx in an intermediary phase, so it's possible to extract the onnx model from my script above.

P.S. This was previously filed as NVIDIA/TensorRT#3116 , but folks overthere suggested reporting this to torch trt, too.

@fumin fumin added the bug Something isn't working label Jul 11, 2023
@narendasan
Copy link
Collaborator

Yes, as far as I know, torch_tensorrt utilizes onnx in an intermediary phase, so it's possible to extract the onnx model from my script above.

Torch-TensorRT does not use ONNX FYI

As to speed for PyTorch vs TensorRT it would be good to know how much of the model is being converted. Printing out the debug logs will tell you (also you should be able to print the compiled graph det2.graph and it will show what is left in pytorch and what is in TRT)

@fumin
Copy link
Author

fumin commented Jul 12, 2023

I set python's logger level to DEBUG and tried to print the compiled model, but the only log I see is

2023-07-12 20:05:35.409 track_tensorrt.py:531 RecursiveScriptModule(original_name=Detectron2_trt)

@narendasan can you be more specific (ideally in code) on how to print debug logs?

@narendasan
Copy link
Collaborator

import torch_tensorrt
...
with torch_tensorrt.logging.debug():
    trt_module = torch_tensorrt.compile(my_module,...)
results = trt_module(input_tensors)

@fumin
Copy link
Author

fumin commented Jul 13, 2023

@narendasan Thanks for the detailed instructions, I am able to dump the logs, which is attached below:
trt.log.tar.gz

The logs are huge, and I can see TensorRT doing a ton of work.
However, given the sheer size, I am a bit lost as to where to start debugging the performance problem.
Do you have any suggested next steps? Do you see anything interesting in the logs?

@narendasan
Copy link
Collaborator

Yeah so the reason I was asking for the logs was the first thing to look at re: performance with torch-tensorrt is how much the graph is getting cut up. Looking here there is upwards of 130 "blocks" or graph breaks. The more switches between PyTorch and TensorRT the worse the performance is. The reason the graph gets cut up is based on support in torch-tensorrt's converter library. So by implementing support for key ops typically performance can be improved.

The next thing I see is there are some ops where TRT just wont be able to run them: e.g. requires_grad and this is causing some dependent ops to stay in Torch even if we have support.

@bowang007 Any idea if we can get more of the torch ops in this graph to run in trt? In particular graphs like this:

INFO: [Torch-TensorRT] - Block segment:Segment Block @0:
    Target: Torch

    Graph: graph(%1 : Tensor):
  %self.backbone.bottom_up.stages.2.4.conv2.norm.weight : Float(256, strides=[1], requires_grad=0, device=cuda:0) = prim::Constant[value=<Tensor>]()
  %self.backbone.bottom_up.stages.2.4.conv2.norm.bias : Float(256, strides=[1], requires_grad=0, device=cuda:0) = prim::Constant[value=<Tensor>]()
  %self.backbone.bottom_up.stages.2.4.conv2.norm.running_mean : Float(256, strides=[1], requires_grad=0, device=cuda:0) = prim::Constant[value=<Tensor>]()
  %self.backbone.bottom_up.stages.2.4.conv2.norm.running_var : Float(256, strides=[1], requires_grad=0, device=cuda:0) = prim::Constant[value=<Tensor>]()
  %6 : bool = prim::Constant[value=0]()
  %7 : float = prim::Constant[value=0.10000000000000001]()
  %self.backbone.bottom_up.stages.3.2.conv3.norm.eps.27 : float = prim::Constant[value=1.0000000000000001e-05]()
  %9 : bool = prim::Constant[value=1]() # /usr/local/detectron2/detectron2/modeling/backbone/fpn.py:168:15
  %0 : Tensor = aten::batch_norm(%1, %self.backbone.bottom_up.stages.2.4.conv2.norm.weight, %self.backbone.bottom_up.stages.2.4.conv2.norm.bias, %self.backbone.bottom_up.stages.2.4.conv2.norm.running_mean, %self.backbone.bottom_up.stages.2.4.conv2.norm.running_var, %6, %7, %self.backbone.bottom_up.stages.3.2.conv3.norm.eps.27, %9) # /home/topunion/.local/lib/python3.8/site-packages/torch/nn/functional.py:2450:11
  return (%0)

@fumin
Copy link
Author

fumin commented Jul 18, 2023

@narendasan thanks for your great analysis! This makes a lot things clearer!

Regarding requires_grad, I see that in the logs, all 3021 instances have requires_grad=0, and 0 means that gradient is actually not needed? In fact, in my script on line 54, I explicitly set with torch.no_grad():, so I'm wondering why would gradients be needed?

In addition, in the graph you quoted, it seems that you are pointing at the backbone, and the backbone as far as I know, should be a pretty vanilla no-frills ResNet

https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/resnet.py#L362

Given that ResNet is pretty old stuff, I'm surprised that the module isn't compiled into one big single graph, unless the detectron2 guys are doing something wrong or wierd?

@bowang007
Copy link
Collaborator

bowang007 commented Jul 19, 2023

Hi @fumin thanks for sharing the log.
According to the log, there are a lot of conditionals in the model, currently in TorchScript path we are converting supported operations in each conditional block to TensorRT engine.
Aside from that, since there are not so many operations in each conditionals, which results in a lot of TensorRT engines which only do several operations in TensorRT.

So, basically after compilation, most part of this model still runs in Torch, I think that's the reason why you observed a slow down.

@fumin fumin changed the title 🐛 [Bug] Encountered bug when using Torch-TensorRT 🐛 [Bug] Detectron2 with TensorRT is slower than vanilla Detectron2 Aug 17, 2023
Copy link

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working No Activity
Projects
None yet
Development

No branches or pull requests

3 participants