Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck; bug caused by elimination exception #1560

Closed
bjaeger1 opened this issue Dec 19, 2022 · 31 comments
Assignees
Labels
bug Something isn't working No Activity

Comments

@bjaeger1
Copy link

bjaeger1 commented Dec 19, 2022

Bug Description

after calling
auto trt_mod = torch_tensorrt::torchscript::compile(module, compile_settings);
the process gets stuck in an infinite(?) loop. I can also observe that the GPU load drops back to 0% after about 1s.

According to this link: #1409 the issue should already have been fixed.

Error message

1 __memmove_avx_unaligned 0x7fff79289cc1
2 std::vectortorch::jit::Use::_M_erase(__gnu_cxx::__normal_iterator<torch::jit::Use *, std::vectortorch::jit::Use>) 0x7fffab48412f
3 torch::jit::Value::replaceFirstUseWith(torch::jit::Value *) 0x7fffab46ff5d
4 torch::jit::Value::replaceAllUsesWith(torch::jit::Value *) 0x7fffab46ffcb
5 torch::jit::EliminateExceptions(torch::jit::Block *) 0x7fffab63c3c9
6 torch::jit::EliminateExceptions(std::shared_ptrtorch::jit::Graph&) 0x7fffab63c999
7 torch_tensorrt::core::lowering::LowerGraph(std::shared_ptrtorch::jit::Graph&, std::vectorc10::IValue&, torch_tensorrt::core::lowering::LowerInfo) 0x7fffd7426b0d
8 torch_tensorrt::core::lowering::Lower(torch::jit::Module const&, std::string, torch_tensorrt::core::lowering::LowerInfo const&) 0x7fffd742a181
9 torch_tensorrt::core::CompileGraph(torch::jit::Module const&, torch_tensorrt::core::CompileSpec) 0x7fffd732b5a8
10 torch_tensorrt::torchscript::compile(torch::jit::Module const&, torch_tensorrt::torchscript::CompileSpec) 0x7fffd7313a04
11 ModelLoader::optimizeWithTensorRT modelloader.cpp 266 0x5ad43c
12 InferenceDisplay::<lambda()>::<lambda()>::operator() inferencedisplay.cpp 1330 0x58c996
13 std::_Function_handler<void(), InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>::<lambda()>>::_M_invoke(const std::_Any_data &) std_function.h 316 0x58c996
14 std::function<void ()>::operator()() const std_function.h 706 0x5cbcca
15 errorwrapper::loading(std::function<void ()>) errorwrapper.cpp 11 0x5cbcca
16 InferenceDisplay::<lambda()>::operator() inferencedisplay.cpp 1333 0x58e127
17 QtPrivate::FunctorCall<QtPrivate::IndexesList<>, QtPrivate::List<>, void, InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>>::call qobjectdefs_impl.h 146 0x58e127
18 QtPrivate::Functor<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0>::call<QtPrivate::List<>, void> qobjectdefs_impl.h 256 0x58e127
19 QtPrivate::QFunctorSlotObject<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0, QtPrivate::List<>, void>::impl(int, QtPrivate::QSlotObjectBase *, QObject *, void * *, bool *) qobjectdefs_impl.h 439 0x58e127
20 QMetaObject::activate(QObject *, int, int, void * *) 0x7fff7a163f8f
...

Expected behavior

successful torch-tensorrt optimization of a torchscript model

Environment

  • Torch-TensorRT Version: v1.3.0
  • PyTorch Version : 1.13.0
  • OS: Linux
  • PyTorch : libtorch 1.13+cu117
  • CUDA version: 11.7
  • cudnn version: 8.5.0.96
  • TensorRT version: 8.5.2.2
@bjaeger1 bjaeger1 added the bug Something isn't working label Dec 19, 2022
@peri044
Copy link
Collaborator

peri044 commented Dec 21, 2022

Can you provide us a reproducer script with the model for us to investigate? Also are you seeing this with the latest release 1.3 and master too ?

@bjaeger1
Copy link
Author

bjaeger1 commented Dec 22, 2022

So far I have tested it only with the latest release torch-tensorrt=1.3.0
With some older libraries the tensorrt optimization was running fine (CUDA11.3, cudnn 8.3.2.44, libtorch-1.10.2+cu113, TensorRT-8.4.0.6 and the torch-tensorrt version for libtorch-1.10.2).

I have built a minimal example with a pretrained model provided by pytorch to reproduce the issue (you only need to update the paths in the .cpp file):
tensorrt_api.zip

might need to set:
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:/usr/local/cudnn/lib:/usr/local/libtorch/lib:/usr/local/cuda/lib64

@bjaeger1
Copy link
Author

Hi @peri044,
could you reproduce the issue with the provided files?

@bobby-chiu
Copy link

Torch-TensorRT works for me when I follow this tutorial (https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/) . But I got the same issues with same traceback for my models. Any ideas to fix this bug? @bjaeger1 @peri044

I use pip to install the related packages as follow:
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 ~/downloads/torch_tensorrt-1.2.0-cp38-cp38-linux_x86_64.whl nvidia-pyindex nvidia-tensorrt==8.4.3.1

Environment

  • Python Version : 3.8.16
  • OS: Ubuntu 20.04.5 LTS
  • Torch-TensorRT Version: v1.2.0
  • PyTorch Version : 1.12.1+cu113
  • cudnn version: 8.3.2
  • TensorRT version: 8.5.3.1

@bjaeger1
Copy link
Author

@bobby-chiu
I provided @peri044 the files to reproduce the issue, but I did not get any feedback. My current workaround is to just downgrad libtorch to 1.10.2+cu113 and use the working torch-tensorrt for that version.
other libs:
CUDA: 11.3
CUDNN: 8.3.2.44
tensorrt 8.4.0.6 (from nvidia)

@gcuendet
Copy link
Contributor

gcuendet commented Apr 20, 2023

I suspect this might be related to issue #1823

@gs-olive
Copy link
Collaborator

Hi @bjaeger1 - could you test your model with PR #1859? We recently identified an issue in torch::jit::EliminateExceptions, and fixed it in that PR, which is pending review. This may resolve the issue with your model as well.

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 6, 2023

Hi @gs-olive - I just tried out the PR #1859.
However I was not able to successfully build torch-tensorrt.

After building the docker image with:

DOCKER_BUILDKIT=1 docker build --build-arg BASE=23.06 --build-arg CUDNN_VERSION=8.9 --build-arg TENSORRT_VERSION=8.6 --build-arg PYTHON_VERSION=3.10 -f docker/Dockerfile -t torch_tensorrt:latest .

and running the container:

nvidia-docker run --gpus all -it --shm-size=8gb --env="DISPLAY" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --name=torch_tensorrt --ipc=host --net=host torch_tensorrt:latest

it fails when checking if torch-tensorrt was compiled successfully:

bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config use_precompiled_torchtrt --config pre_cxx11_abi

ERROR:
2023/07/06 13:45:54 Downloading https://releases.bazel.build/5.2.0/release/bazel-5.2.0-linux-x86_64...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
ERROR: /opt/torch_tensorrt/WORKSPACE:57:21: fetching new_local_repository rule //external:libtorch_pre_cxx11_abi: java.io.IOException: The repository's path is "/opt/python3/site-packages/torch/" (absolute: "/opt/python3/site-packages/torch") but this directory does not exist.
ERROR: /opt/torch_tensorrt/tests/core/conversion/converters/BUILD:10:15: //tests/core/conversion/converters:test_activation depends on @libtorch_pre_cxx11_abi//:libtorch in repository @libtorch_pre_cxx11_abi which failed to fetch. no such package '@libtorch_pre_cxx11_abi//': The repository's path is "/opt/python3/site-packages/torch/" (absolute: "/opt/python3/site-packages/torch") but this directory does not exist.
ERROR: Analysis of target '//tests/core/conversion/converters:test_activation' failed; build aborted: Analysis failed
INFO: Elapsed time: 6.027s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (35 packages loaded, 128 targets configured)
FAILED: Build did NOT complete successfully (35 packages loaded, 128 targets configured)
Fetching @local_config_cc; Restarting.
Fetching @googletest; Cloning 703bd9caab50b139428cea1aaff9974ebee5742e of https://github.com/google/googletest

Output when building the docker image:
=> [internal] load build definition from Dockerfile
=> => transferring dockerfile: 5.17kB
=> [internal] load .dockerignore
=> => transferring context: 1.00kB
=> [internal] load metadata for docker.io/nvidia/cuda:11.8.0-devel-ubuntu22.04
=> [internal] load build context
=> => transferring context: 58.40MB
=> [base 1/21] FROM docker.io/nvidia/cuda:11.8.0-devel-ubuntu22.04
=> [base 2/21] RUN test -n "8.6" || (echo "No tensorrt version specified, please use --build-arg TENSORRT_VERSION=x.y to specify a version." && exit 1)
=> [base 3/21] RUN test -n "8.9" || (echo "No cudnn version specified, please use --build-arg CUDNN_VERSION=x.y to specify a version." && exit 1)
=> [base 4/21] RUN apt-get update
=> [base 5/21] RUN apt install -y build-essential manpages-dev wget zlib1g software-properties-common git libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget ca-certificates curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev
=> [base 6/21] RUN wget -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer && chmod 755 pyenv-installer && bash pyenv-installer && eval "$(pyenv init -)"
=> [base 7/21] RUN pyenv install -v 3.10
=> [base 8/21] RUN pyenv global 3.10
=> [base 9/21] RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
=> [base 10/21] RUN mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
=> [base 11/21] RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/7fa2af80.pub
=> [base 12/21] RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 536F8F1DE80F6A35
=> [base 13/21] RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
=> [base 14/21] RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
=> [base 15/21] RUN apt-get update
=> [base 16/21] RUN apt-get install -y libcudnn8=8.9* libcudnn8-dev=8.9* 108.6s
=> [base 17/21] RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
=> [base 18/21] RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
=> [base 19/21] RUN apt-get update
=> [base 20/21] RUN apt-get install -y libnvinfer8=8.6.* libnvinfer-plugin8=8.6.* libnvinfer-dev=8.6.* libnvinfer-plugin-dev=8.6.* libnvonnxparsers8=8.6.* libnvonnxparsers-dev=8.6.* libnvparsers8=8.6.* libnvparsers-dev=8.6.*
=> [base 21/21] RUN wget -q https://github.com/bazelbuild/bazelisk/releases/download/v1.16.0/bazelisk-linux-amd64 -O /usr/bin/bazel && chmod a+x /usr/bin/bazel
=> [torch-tensorrt-builder-base 1/3] RUN apt-get install -y python3-setuptools
=> [torch-tensorrt 1/7] COPY . /opt/torch_tensorrt => [torch-tensorrt-builder-base 2/3] RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
=> [torch-tensorrt-builder-base 3/3] RUN apt-get update && apt-get install -y --no-install-recommends locales ninja-build && rm -rf /var/lib/apt/lists/* && locale-gen en_US.UTF-8
=> [torch-tensorrt-builder 1/6] COPY . /workspace/torch_tensorrt/src
=> [torch-tensorrt-builder 2/6] WORKDIR /workspace/torch_tensorrt/src => [torch-tensorrt-builder 3/6] RUN cp ./docker/WORKSPACE.docker WORKSPACE
=> [torch-tensorrt-builder 4/6] RUN mkdir -p "/opt/python3/" && ln -s "pyenv which python | xargs dirname | xargs dirname/lib/python3.10/site-packages" "/opt/python3/"
=> [torch-tensorrt-builder 5/6] RUN CUDA_BASE_IMG_VERSION_INTERMEDIATE=echo ${BASE_IMG#*:} && CUDA_BASE_IMG_VERSION=echo ${CUDA_BASE_IMG_VERSION_INTERMEDIATE%%-*} && CUDA_MAJOR_MINOR_VERSION=echo ${CUDA_BASE_IMG_VERSION%.*} && rm -fr /usr/local/cuda && ln 0.5s
=> [torch-tensorrt-builder 6/6] RUN bash ./docker/dist-build.sh
=> [torch-tensorrt 2/7] COPY --from=torch-tensorrt-builder /workspace/torch_tensorrt/src/py/dist/ .
=> [torch-tensorrt 3/7] RUN cp /opt/torch_tensorrt/docker/WORKSPACE.docker /opt/torch_tensorrt/WORKSPACE
=> [torch-tensorrt 4/7] RUN pip install -r /opt/torch_tensorrt/py/requirements.txt
=> [torch-tensorrt 5/7] RUN pip install tensorrt==8.6.*
=> [torch-tensorrt 6/7] RUN pip install .whl && rm -fr /workspace/torch_tensorrt/py/dist/ *.whl
=> [torch-tensorrt 7/7] WORKDIR /opt/torch_tensorrt
=> exporting to image
=> => exporting layers
=> => writing image sha256:19df2980268d6c1d5692986c9d743acba55615d3e6cd59101537bf47d72b2d38
=> => naming to docker.io/library/torch_tensorrt:latest

@gs-olive
Copy link
Collaborator

gs-olive commented Jul 6, 2023

Hi @bjaeger1 - I just rebased #1859 onto main, since a few dependencies were incorrect in the existing version. Additionally, regarding the docker build we recommend using the default (BASE_IMG=nvidia/cuda:12.1.1-devel-ubuntu22.04) as the base build; the following is working on my machine (when building from the #1859 PR branch directly):

DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.9 --build-arg TENSORRT_VERSION=8.6 --build-arg PYTHON_VERSION=3.10 -f docker/Dockerfile -t torch_tensorrt:latest .

Could you try the build again with a fresh pull of that branch?

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 7, 2023

Hi @gs-olive , thanks for the quick answer!
I just made a fresh pull of your eliminate_exceptions_removal-Branch and built a new image with the default base-image.

However, the command:
bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config use_precompiled_torchtrt --config pre_cxx11_abi
still fails with the same error message. It can't find /opt/python3/site-packages/torch/ (because there is no such directory).

But the docker-image build process stats:
=> CACHED [torch-tensorrt-builder 4/6] RUN mkdir -p "/opt/python3/" && ln -s "pyenv which python | xargs dirname | xargs dirname/lib/python3.10/site-packages" "/opt/python3/"

@gs-olive
Copy link
Collaborator

gs-olive commented Jul 7, 2023

Hi @bjaeger1 - thanks for the follow-up. I was able to reproduce the issue and I addressed the problem in #2085, which adds the necessary /opt/python3 symlink in the final container. Could you add that code to the Dockerfile at the specified location, rebuild the container, and run:

bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary  --config pre_cxx11_abi

(I had to remove --config use_precompiled_torchtrt, since recompilation is needed here to use the C++ APIs).

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 10, 2023

Hi @gs-olive , after adding the mentioned lines for the symlink to the Dockerfile, I was finally able to successfully build torch_tensorrt. Thanks!
PS: One unit-test failed:
_FAIL: //tests/core/conversion/converters:test_activation (see /root/.cache/bazel/bazel_root/272136ab8790bfb0c01be73c7cbef828/execroot/Torch-TensorRT/bazel-out/k8-opt/testlogs/tests/core/conversion/converters/test_activation/test.log)

I currently have difficulties (linking torch_tensorrt against libtorch fails (undefined references)) when building my minimal example. I'll come back when I fixed that issue.

@bjaeger1
Copy link
Author

@gs-olive, I would assume that the linking error is because the torch_tensorrt I built is using CUDA-12.1 (BASE_IMG=nvidia/cuda:12.1.1-devel-ubuntu22.04) but the libtorch-version I downloaded from the official website is with CUDA-11.8.

Trying to build torch_tensorrt with BASE_IMG=nvidia/cuda:11.8.0-devel-ubuntu22.04 is somehow not possible and fails with:

36 225.8 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
36 225.8 running bdist_wheel
36 225.8 2023/07/10 13:28:45 Downloading https://releases.bazel.build/5.2.0/release/bazel-5.2.0-linux-x86_64...
36 226.6 Extracting Bazel installation...
36 233.4 Starting local Bazel server and connecting to it...
36 234.6 Loading: 
36 234.6 Loading: 0 packages loaded
36 235.7 Loading: 0 packages loaded
36 235.7     currently loading: 
36 235.8 Analyzing: target //:libtorchtrt (1 packages loaded, 0 targets configured)
36 236.6 ERROR: /workspace/torch_tensorrt/src/WORKSPACE:35:21: fetching new_local_repository rule //external:cuda: java.io.IOException: The repository's path is "/usr/local/cuda/" (absolute: "/usr/local/cuda") but this directory does not exist.
36 236.7 ERROR: /root/.cache/bazel/_bazel_root/67e780960d5a9319501f4e5a06fc77dc/external/tensorrt/BUILD.bazel:108:11: @tensorrt//:nvinfer depends on @cuda//:cudart in repository @cuda which failed to fetch. no such package '@cuda//': The repository's path is "/usr/local/cuda/" (absolute: "/usr/local/cuda") but this directory does not exist.
36 236.7 ERROR: Analysis of target '//:libtorchtrt' failed; build aborted: 
36 236.7 INFO: Elapsed time: 10.139s
36 236.7 INFO: 0 processes.
36 236.7 FAILED: Build did NOT complete successfully (67 packages loaded, 1423 targets configured)
36 236.7 FAILED: Build did NOT complete successfully (67 packages loaded, 1423 targets configured)
36 236.7 building libtorchtrt
------
executor failed running [/bin/sh -c bash ./docker/dist-build.sh]: exit code: 1

@bjaeger1 bjaeger1 changed the title 🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck bug caused by elimination exception 🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck; bug caused by elimination exception Jul 10, 2023
@gs-olive
Copy link
Collaborator

Hi @bjaeger1 - thanks for the follow-up. It seems likely that the mismatched libtorch/CUDA versions are contributing to this issue. We very recently updated the stack on main to CUDA 12.1 from 11.8. If you prefer to use CUDA 11.8, I would recommend applying commit f05290a, which rolls back to CUDA 11.8 everywhere in the code, except in the CI. For a branch containing these changes applied to main, see cuda_118_rollback. You can clone the branch and run the following to build the container, and run any necessary tests.

DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.8 --build-arg TENSORRT_VERSION=8.6 -f docker/Dockerfile -t torch_tensorrt:latest .

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 11, 2023

Hi @gs-olive , I prefer to use CUDA 11.8 and wait until the pytorch binaries are released with CUDA 12.1 instead of building it from source.

I pulled the cuda_118_rollback branch and built the image. The torch_tensorrt build fails with:

Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Analyzed target //tests/core/conversion/converters:test_activation (75 packages loaded, 9001 targets configured).
INFO: Found 1 test target...
ERROR: /root/.cache/bazel/_bazel_root/272136ab8790bfb0c01be73c7cbef828/external/cuda/BUILD.bazel:12:11: SolibSymlink _solib_k8/_U@cuda_S_S_Ccudart___Ulib64/libcudart.so failed: missing input file 'external/cuda/lib64/libcudart.so', owner: '@cuda//:lib64/libcudart.so'
ERROR: /root/.cache/bazel/_bazel_root/272136ab8790bfb0c01be73c7cbef828/external/cuda/BUILD.bazel:12:11: SolibSymlink _solib_k8/_U@cuda_S_S_Ccudart___Ulib64/libcudart.so failed: 1 input file(s) do not exist
Target //tests/core/conversion/converters:test_activation failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /root/.cache/bazel/_bazel_root/272136ab8790bfb0c01be73c7cbef828/external/cuda/BUILD.bazel:12:11 SolibSymlink _solib_k8/_U@cuda_S_S_Ccudart___Ulib64/libcudart.so failed: 1 input file(s) do not exist
INFO: Elapsed time: 11.506s, Critical Path: 0.04s
INFO: 30 processes: 30 internal.
FAILED: Build did NOT complete successfully
//tests/core/conversion/converters:test_activation              FAILED TO BUILD

In /root/.cache/bazel/_bazel_root/272136ab8790bfb0c01be73c7cbef828/external/cuda a lot of *.so files are missing. (I compared the files from the docker container built with CUDA 12.1 to the one built with CUDA 11.8)

Again, thanks a lot for your effort!

PS: I also tried to add the changes from commit f05290a to the eliminate_exceptions_removal-Branch + BASE_IMG=nvidia/cuda:11.8.0-devel-ubuntu22.04, but also without success..

@gs-olive
Copy link
Collaborator

Just to check - is this failure occurring during the docker build command, or once the container starts? I am able to build the container successfully on my machine, which is why I ask.

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 12, 2023

The docker build command works fine. In both cases described in my previous comment, the problem arises inside the running docker container, either running:
bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi

or:
/docker/dist-build.sh

@bjaeger1
Copy link
Author

Update: I just installed the libtorch nightly binary which comes with CUDA 12.1

I built a torch_tensorrt-docker-image with CUDA-12.1 and then the torch_tensorrt library.
When compiling my minimal example the compilation process fails with 3 undefined references to glibc:

~/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api> ./build.sh
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc-11 - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-11 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found TensorRT headers at /usr/local/TensorRT/include
-- Found TensorRT libs at /usr/local/TensorRT/lib/libnvinfer.so;/usr/local/TensorRT/lib/libnvinfer_plugin.so
-- Found TENSORRT: /usr/local/TensorRT/include
-- Found CUDA: /usr/local/cuda (found version "12.1")
-- The CUDA compiler identification is NVIDIA 12.1.105
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.1.105")
-- Caffe2: CUDA detected: 12.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.1
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is b51b459d
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Autodetected CUDA architecture(s): 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found Torch: /usr/local/libtorch/lib/libtorch.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/jrb/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api/build
[2/2] Linking CXX executable TensorRT-app
FAILED: TensorRT-app
: && /usr/bin/g++-11 -D_GLIBCXX_USE_CXX11_ABI=0 -g -rdynamic CMakeFiles/TensorRT-app.dir/TensorRT-app.cpp.o -o TensorRT-app -L/usr/local/torch_tensorrt/lib -Wl,-rpath,/usr/local/cudnn/lib:/usr/local/TensorRT/lib:/usr/local/torch_tensorrt/lib:/usr/local/libtorch/lib:/usr/local/cuda/lib64 /usr/local/cudnn/lib/libcudnn.so.8 /usr/local/TensorRT/lib/libnvinfer.so /usr/local/TensorRT/lib/libnvinfer_plugin.so -ltorchtrt /usr/local/libtorch/lib/libtorch.so /usr/local/libtorch/lib/libc10.so /usr/local/libtorch/lib/libkineto.a -lcuda /usr/local/cuda/lib64/libnvrtc.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/local/libtorch/lib/libc10_cuda.so -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch_cuda.so" -Wl,--as-needed /usr/local/libtorch/lib/libc10_cuda.so /usr/local/libtorch/lib/libc10.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublasLt.so -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch.so" -Wl,--as-needed && :

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to lstat@GLIBC_2.33

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to __libc_single_threaded@GLIBC_2.32

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to stat@GLIBC_2.33

collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

@gs-olive
Copy link
Collaborator

That makes sense, thank you for the details. I was able to reproduce the error, and it seems to be caused by an issue in the WORKSPACE.docker file when linking cuda. I've addressed the issue in the cuda_118_rollback branch, and I am now able to build the container on that branch:

DOCKER_BUILDKIT=1 docker build --build-arg CUDNN_VERSION=8.8 --build-arg TENSORRT_VERSION=8.6 -f docker/Dockerfile -t torch_tensorrt:latest .

The following command succeeds from within the container, on my machine:

bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary  --config pre_cxx11_abi

Please let me know if the latest updates work for you as well.

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 13, 2023

Hi, with the latest update I was able to build the image and run the container. You only changed
path = "/usr/local/cuda/" to path = "/usr/local/cuda-11.8/" in WORKSPACE.docker, right?

The commands:
bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi and /docker/dist-build.h finally work, thanks!

When compiling my example there are a lot of undefined references:

[2/2] Linking CXX executable TensorRT-app
FAILED: TensorRT-app
: && /usr/bin/g++-11 -D_GLIBCXX_USE_CXX11_ABI=0 -g -rdynamic CMakeFiles/TensorRT-app.dir/TensorRT-app.cpp.o -o TensorRT-app -L/usr/local/torch_tensorrt/lib -Wl,-rpath,/usr/local/cudnn/lib:/usr/local/TensorRT/lib:/usr/local/torch_tensorrt/lib:/usr/local/libtorch/lib:/usr/local/cuda/lib64 /usr/local/cudnn/lib/libcudnn.so.8 /usr/local/TensorRT/lib/libnvinfer.so /usr/local/TensorRT/lib/libnvinfer_plugin.so -ltorchtrt /usr/local/libtorch/lib/libtorch.so /usr/local/libtorch/lib/libc10.so /usr/local/libtorch/lib/libkineto.a -lcuda /usr/local/cuda/lib64/libnvrtc.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/local/libtorch/lib/libc10_cuda.so -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch_cuda.so" -Wl,--as-needed /usr/local/libtorch/lib/libc10_cuda.so /usr/local/libtorch/lib/libc10.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublasLt.so -Wl,--no-as-needed,"/usr/local/libtorch/lib/libtorch.so" -Wl,--as-needed && :

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::cuda::SetDevice(int)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::SymInt::promote_to_negative()

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::detail::ListImpl::ListImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >, c10::Type::SingletonOrSharedTypePtr<c10::Type>)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to lstat@GLIBC_2.33

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to __libc_single_threaded@GLIBC_2.32

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to at::_ops::randint_low::call(c10::SymInt, c10::SymInt, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::cuda::ExchangeDevice(int)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::cuda::MaybeSetDevice(int)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::cuda::GetDevice(int*)

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::SymInt::toSymNode() const

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to stat@GLIBC_2.33

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to c10::TensorImpl::throw_data_ptr_access_error() const

collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

When comparing the undefined references to the ones from my previous comment (where I used the nigthly-libtorch-version with CUDA-12.1 and also built torch_tensorrt with CUDA-12.1) the same 3 glibc references are missing + additional c10 errors.

I guess there is still somewhere a (cuda ?)-version mismatch in torch_tensorrt.

@gs-olive
Copy link
Collaborator

gs-olive commented Jul 13, 2023

Yes, changing the path in the WORKSPACE.docker was the update that fixed the initial issue.

After discussing the issue with @peri044 - could you try the following from within the container prior to building your example. The issue may be with the use of pre_cxx11_abi naming in the container.

# Make directory for modeling files
mkdir modeling
cd modeling

# Make compilation/Torch-TRT build file
touch run.cpp
# See below for demo BUILD file
touch BUILD

# Build using Bazel
cd ..
bazel build modeling:my_custom_model --config pre_cxx11_abi

Demo BUILD file - adapted from cpp/bin/torchtrtc/BUILD

load("@rules_pkg//:pkg.bzl", "pkg_tar")

config_setting(
    name = "use_pre_cxx11_abi",
    values = {
        "define": "abi=pre_cxx11_abi",
    },
)

cc_binary(
    name = "my_custom_model",
    srcs = [
            "run.cpp"
    ],
    linkopts = [
        "-ldl",
    ],
    deps = [
        "//third_party/args",
        "//cpp:torch_tensorrt",
    ] + select({
        ":use_pre_cxx11_abi": [
            "@libtorch_pre_cxx11_abi//:libtorch",
            "@libtorch_pre_cxx11_abi//:caffe2",
        ],
        "//conditions:default": [
            "@libtorch//:libtorch",
            "@libtorch//:caffe2",
        ],
    }),
)

@bjaeger1
Copy link
Author

I tried your suggestion but the build failed:

root@pc:/opt/torch_tensorrt# bazel build modeling:my_custom_model --config pre_cxx11_abi
INFO: Analyzed target //modeling:my_custom_model (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /opt/torch_tensorrt/modeling/BUILD:10:10: Linking modeling/my_custom_model failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc @bazel-out/k8-fastbuild/bin/modeling/my_custom_model-2.params

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o:function _start: error: undefined reference to 'main'
collect2: error: ld returned 1 exit status
Target //modeling:my_custom_model failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.802s, Critical Path: 0.40s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully

@gs-olive
Copy link
Collaborator

Thanks for testing that out. I was able to reproduce that message, but only when the run.cpp file is empty. If I populate the run.cpp file (only a main function is really required) as follows, for instance:

Demo run.cpp

#include "torch/csrc/autograd/grad_mode.h"
#include "torch/csrc/jit/runtime/graph_executor.h"
#include "torch/script.h"
#include "torch_tensorrt/logging.h"
#include "torch_tensorrt/torch_tensorrt.h"

int main(int argc, char** argv) {
  // Compile, infer, ...
  printf("Output ...\n");
  return 0;
}

Then I run the following:

cd /opt/torch_tensorrt
bazel build modeling:my_custom_model --config pre_cxx11_abi
./bazel-bin/modeling/my_custom_model

The above succeeds on my instance of the container.

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 17, 2023

hi @gs-olive , on my side your example now also works fine!

Analogous to your example, I made another folder in /opt/torch_tensorrt/ containing a BUILD file (same like yours) and a run.cpp file.
My run.cpp:

#include "torch/csrc/autograd/grad_mode.h"
#include "torch/csrc/jit/runtime/graph_executor.h"
#include "torch/script.h"
#include "torch_tensorrt/logging.h"
#include "torch_tensorrt/torch_tensorrt.h"

#include <iostream>
#include <string>

int main()
{
  //Paths
  std::string path = "/opt/torch_tensorrt/minimal_ex/model_scripted.pth";
  std::string path_save = "test.trtt";

  // load the model
  torch::jit::Module module;
  try
  {
    // deserialize the ScriptModule from a file using torch::jit::load().
    module = torch::jit::load(path);
  }
  catch(const std::runtime_error& re)
  {
    std::cerr << "Runtime error: " << re.what() << std::endl;
    return 0;
  }
  catch(const std::exception& ex)
  {
    std::cerr << "Error occurred: " << ex.what() << std::endl;
    return 0;
  }
  catch(...)
  {
    std::cerr << "Unknown failure occurred. Possible memory corruption" << std::endl;
    return 0;    
  }
  std::cout << "Scripted Model successfully loaded!\n";

  module.to(torch::kCUDA); //run on GPU
  module.eval();

  // compile using Torch-TensorRT
  // example input
  auto example_input = torch_tensorrt::Input(std::vector<int64_t>{1, 3, 224, 224}, torch::kHalf);
  auto compile_settings = torch_tensorrt::torchscript::CompileSpec({example_input});
  // fp16 execution
  compile_settings.enabled_precisions = {torch::kHalf};

  // compile module to tensorrt
  std::cout << "Its compiling......\n";
  auto trt_mod = torch_tensorrt::torchscript::compile(module, compile_settings);

  // TensorRT conversion successful
  std::cout << "Conversion from Scripted Model to TensorRT Model was successful!\n";
  std::cout << "TensorRT Model is saved at: " << path_save << "\n";

  // save TensorRT model for later
  trt_mod.save(path_save);
}

Then I run the following

cd /opt/torch_tensorrt
bazel build minimal_ex:my_custom_model --config pre_cxx11_abi
./bazel-bin/minimal_ex/my_custom_model

The model compiles also successfully but when running the executable the torch-tensorrt compile functions gets again stuck in an infinite loop...
The output I get is:

...
DEBUG: [Torch-TensorRT - Debug Build] - Registering evaluator for prim::TupleIndex
DEBUG: [Torch-TensorRT - Debug Build] - Registering evaluator for prim::TupleUnpack
DEBUG: [Torch-TensorRT - Debug Build] - Registering evaluator for prim::unchecked_cast
DEBUG: [Torch-TensorRT - Debug Build] - Registering evaluator for prim::Uninitialized
DEBUG: [Torch-TensorRT - Debug Build] - Registering evaluator for prim::RaiseException
Scripted Model successfully loaded!
Its compiling......
DEBUG: [Torch-TensorRT - Debug Build] - Torch-TensorRT Version: 1.5.0
Using TensorRT Version: 8.6.1.6
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

DEBUG: [Torch-TensorRT - Debug Build] - Settings requested for Lowering:
    torch_executed_modules: [
    ]
-->>>>> gets stuck here!

@gs-olive
Copy link
Collaborator

Thanks for the update, and good to hear it gets through more of the compilation! I just rebased the cuda_118_rollback branch onto main, which now includes the fix in #1859, which is intended to address this halting issue. Could you try pulling the branch and building the Docker container from scratch again, then running your example on the new container?

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 18, 2023

I have really good news! The torch-tensorrt compilation of my minimal example finally works inside the container! The issue of the infinite loop is solved! Great, thanks a lot!

From which folder inside the running container am I supposed to download the built torch-tensorrt files? I want to be able to compile my minimal example also outside the docker container.
For example I tried:
docker cp <container-Id>:/opt/torch_tensorrt/py/torch_tensorrt /path/to/local/folder
or
docker cp <container-Id>:/opt/torch_tensorrt/py/dist/torch_tensorrt-1.5.0.dev0+65d509c9-cp310-cp310-linux_x86_64.whl /path/to/local/folder

However, when compiling the example on my system outside the container (with CMake) there is a linking error.

Locally I have symlinks to the respective libraries:

/usr/local/cuda -> /cuda-11.8
/usr/local/cudnn -> cudnn-linux-x86_64-8.8.1.3_cuda11-archive/
/usr/local/libtorch-> /libtorch-2.0.1+cu118-precxx11ABI/
/usr/local/TensorRT -> /TensorRT-8.6.1.6/
/usr/local/torch_tensorrt -> /torch-tensorrt-1.5.0-dev/py/torch_tensorrt

@gs-olive
Copy link
Collaborator

Great to hear it works on the example now! After discussing with @narendasan, there are a few options to use the example outside the Docker container. One is to copy the torch_tensorrt directory to the local as you did, and then modify this line in the WORKSPACE to point to the path to the local folder:

TensorRT/WORKSPACE

Lines 34 to 38 in e884820

# External dependency for torch_tensorrt if you already have precompiled binaries.
local_repository(
name = "torch_tensorrt",
path = "/opt/conda/lib/python3.8/site-packages/torch_tensorrt",
)

Then, you can rebuild the bazel target similarly to what you did from the Docker container.

Alternatively, you could rebuild the library on your local and add a new target for the run.cpp file. See here for a sample/tutorial.

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 21, 2023

Short update: only applying our change is not enough. Bazel can somehow not fetch the libraries:

new_local_repository(
    name = "libtorch",
    path = "/usr/local/torch_tensorrt-1.5.0-dev2/opt/python3/site-packages/torch/",
    build_file = "third_party/libtorch/BUILD"
)

new_local_repository(
    name = "libtorch_pre_cxx11_abi",
    path = "/usr/local/torch_tensorrt-1.5.0-dev2/opt/python3/site-packages/torch/",
    build_file = "third_party/libtorch/BUILD"
)

I therefore added instead:

http_archive(
    name = "libtorch",
    build_file = "@//third_party/libtorch:BUILD",
    sha256 = "b0e5b7fd8935d317dd91ca39dbb1f35e0a9a261550eae77249d947a908c3493e",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.0.dev20230703%2Bcu118.zip"],
)

http_archive(
    name = "libtorch_pre_cxx11_abi",
    build_file = "@//third_party/libtorch:BUILD",
    sha256 = "49ec0e4dbf332058bffdd011e83c597d35c1531e465c02cc2cda10959a6b5063",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/nightly/cu118/libtorch-shared-with-deps-2.1.0.dev20230703%2Bcu118.zip"],
)

also the local cudnn & torch_tensorrt are not found so I changed the paths from path="/usr" to path="/usr/local/cudnn" path="/usr/local/torch_tensorrt" (which makes sense).

The problem then is the following:

ERROR: /root/.cache/bazel/_bazel_root/58fc92e22fb39b158c690f9065b2b6f2/external/tensorrt/BUILD.bazel:29:11: @tensorrt//:nvinfer_headers: missing input file 'external/tensorrt/include/x86_64-linux-gnu/NvUtils.h', owner: '@tensorrt//:include/x86_64-linux-gnu/NvUtils.h'
Target //minimal_ex:my_custom_model failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /root/.cache/bazel/_bazel_root/58fc92e22fb39b158c690f9065b2b6f2/external/tensorrt/BUILD.bazel:97:10 1 input file(s) do not exist
INFO: Elapsed time: 0.387s, Critical Path: 0.01s
INFO: 0 processes.
FAILED: Build did NOT complete successfully

which looks like it is related to this issue: #45 although I am not sure which lines to modify in third_party/tensorrt/BUILD

PS: in WORKSPACE I also had to remove the "-" in the workspace name ("Torch-TensorRT" --> "TorchTensorRT")

@narendasan
Copy link
Collaborator

Short update: only applying our change is not enough. Bazel can somehow not fetch the libraries:

new_local_repository(
   name = "libtorch",
   path = "/usr/local/torch_tensorrt-1.5.0-dev2/opt/python3/site-packages/torch/",
   build_file = "third_party/libtorch/BUILD"
)

new_local_repository(
   name = "libtorch_pre_cxx11_abi",
   path = "/usr/local/torch_tensorrt-1.5.0-dev2/opt/python3/site-packages/torch/",
   build_file = "third_party/libtorch/BUILD"
)

I therefore added instead:

http_archive(
   name = "libtorch",
   build_file = "@//third_party/libtorch:BUILD",
   sha256 = "b0e5b7fd8935d317dd91ca39dbb1f35e0a9a261550eae77249d947a908c3493e",
   strip_prefix = "libtorch",
   urls = ["https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-2.1.0.dev20230703%2Bcu118.zip"],
)

http_archive(
   name = "libtorch_pre_cxx11_abi",
   build_file = "@//third_party/libtorch:BUILD",
   sha256 = "49ec0e4dbf332058bffdd011e83c597d35c1531e465c02cc2cda10959a6b5063",
   strip_prefix = "libtorch",
   urls = ["https://download.pytorch.org/libtorch/nightly/cu118/libtorch-shared-with-deps-2.1.0.dev20230703%2Bcu118.zip"],
)

This is fine as long as 2.1.0.dev20230703+cu118 is the version you used to build torch-tensorrt in the container or you are building from source (make sure to set the cuda version to 11.8 for that dependency in the workspace)

also the local cudnn & torch_tensorrt are not found so I changed the paths >from path="/usr" to path="/usr/local/cudnn" path="/usr/local/torch_tensorrt" >(which makes sense).

The problem then is the following:

ERROR: /root/.cache/bazel/_bazel_root/58fc92e22fb39b158c690f9065b2b6f2/external/tensorrt/BUILD.bazel:29:11: @tensorrt//:nvinfer_headers: missing >input file 'external/tensorrt/include/x86_64-linux-gnu/NvUtils.h', owner: '@tensorrt//:include/x86_64-linux-gnu/NvUtils.h'
Target //minimal_ex:my_custom_model failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /root/.cache/bazel/_bazel_root/58fc92e22fb39b158c690f9065b2b6f2/external/tensorrt/BUILD.bazel:97:10 1 input file(s) do not exist
INFO: Elapsed time: 0.387s, Critical Path: 0.01s
INFO: 0 processes.
FAILED: Build did NOT complete successfully

which looks like it is related to this issue: https://github.com/pytorch/TensorRT/issues/45 although I am not sure which lines to modify in third_party/tensorrt/BUILD

PS: in WORKSPACE I also had to remove the "-" in the workspace name ("Torch-TensorRT" --> "TorchTensorRT")

For building with bazel the easiest (i.e. least prone to error way) to include the cudnn and tensorrt dependencies is to use the tarballs without unpacking them as inputs to http archive:

On my systems after downloading the correct builds from developer.nvidia.com, my workspace will look like this (give or take a few changes related to version upgrades):

workspace(name = "Torch-TensorRT")

load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

http_archive(
    name = "rules_python",
    sha256 = "863ba0fa944319f7e3d695711427d9ad80ba92c6edd0b7c7443b84e904689539",
    strip_prefix = "rules_python-0.22.0",
    url = "https://github.com/bazelbuild/rules_python/releases/download/0.22.0/rules_python-0.22.0.tar.gz",
)

load("@rules_python//python:repositories.bzl", "py_repositories")

py_repositories()

http_archive(
    name = "rules_pkg",
    sha256 = "8f9ee2dc10c1ae514ee599a8b42ed99fa262b757058f65ad3c384289ff70c4b8",
    urls = [
        "https://mirror.bazel.build/github.com/bazelbuild/rules_pkg/releases/download/0.9.1/rules_pkg-0.9.1.tar.gz",
        "https://github.com/bazelbuild/rules_pkg/releases/download/0.9.1/rules_pkg-0.9.1.tar.gz",
    ],
)

load("@rules_pkg//:deps.bzl", "rules_pkg_dependencies")

rules_pkg_dependencies()

http_archive(
    name = "googletest",
    sha256 = "755f9a39bc7205f5a0c428e920ddad092c33c8a1b46997def3f1d4a82aded6e1",
    strip_prefix = "googletest-5ab508a01f9eb089207ee87fd547d290da39d015",
    urls = ["https://github.com/google/googletest/archive/5ab508a01f9eb089207ee87fd547d290da39d015.zip"],
)

# External dependency for torch_tensorrt if you already have precompiled binaries.
local_repository(
    name = "torch_tensorrt",
    path = "/opt/conda/lib/python3.8/site-packages/torch_tensorrt",
)

# CUDA should be installed on the system locally
new_local_repository(
    name = "cuda",
    build_file = "@//third_party/cuda:BUILD",
    path = "/usr/local/cuda-12.1/",
)

#############################################################################################################
# Tarballs and fetched dependencies (default - use in cases when building from precompiled bin and tarballs)
#############################################################################################################

http_archive(
    name = "libtorch",
    build_file = "@//third_party/libtorch:BUILD",
    sha256 = "1ae8366aaf7af7f68f142ba644fe26c837c6fa8347ec6bd9ce605ac60e7f7e5e",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-cxx11-abi-shared-with-deps-2.1.0.dev20230703%2Bcu121.zip"],
)

http_archive(
    name = "libtorch_pre_cxx11_abi",
    build_file = "@//third_party/libtorch:BUILD",
    sha256 = "9add4832f4da9223866d85810820b816ab3319d5a227066101eeb6cbb76adb4b",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-shared-with-deps-2.1.0.dev20230703%2Bcu121.zip"],
)

http_archive(
    name = "cudnn",
    build_file = "@//third_party/cudnn/archive:BUILD",
    sha256 = "79d77a769c7e7175abc7b5c2ed5c494148c0618a864138722c887f95c623777c",
    strip_prefix = "cudnn-linux-x86_64-8.8.1.3_cuda12-archive",
    urls = [
        "file:///<ABSOLUTE PATH TO DOWNLOAD ON SYSTEM>/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz",
        "https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz",
    ],
)

http_archive(
    name = "tensorrt",
    build_file = "@//third_party/tensorrt/archive:BUILD",
    sha256 = "0f8157a5fc5329943b338b893591373350afa90ca81239cdadd7580cd1eba254",
    strip_prefix = "TensorRT-8.6.1.6",
    urls = [
        "file:///<ABSOLUTE PATH TO DOWNLOAD ON SYSTEM>/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz",
        "https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/8.6.1/tars/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz",
    ],
)

#########################################################################
# Development Dependencies (optional - comment out on aarch64)
#########################################################################

load("@rules_python//python:pip.bzl", "pip_parse")

pip_parse(
    name = "devtools_deps",
    requirements = "//:requirements-dev.txt",
)

load("@devtools_deps//:requirements.bzl", "install_deps")

install_deps()

The new_local_repository sources can be used but they are written by default to prefer the layout of files that a conventional install of the TensorRT and cuDNN debian packages will create. We have configurations for CentOS as well but that is enabled via a build flag. If you just unpacked the tarballs somewhere the archive:BUILD files will be correct compared to the local:BUILD ones

@bjaeger1
Copy link
Author

bjaeger1 commented Jul 28, 2023

Hi, the reason why my libtorch new_local_repository where not found, was because I using symlink

For the cudnn & tensorrt issue I now use, as you told me, the tar-files which work fine!

However when building the example its reported that bazel has some issues with libtorch(?):
(The libtorch version I use inside the container is definitely the same as I am using outside the container)

ERROR: /home/dev/torch_tensorrt-1.5.0-dev2/opt/torch_tensorrt/modeling/BUILD:10:10: Linking of rule '//modeling:my_custom_model' failed (Exit 1) gcc failed: error executing command /usr/bin/gcc @bazel-out/k8-fastbuild/bin/modeling/my_custom_model-2.params

Use --sandbox_debug to see verbose messages from the sandbox
/usr/bin/ld: warning: libcudart-d0da41ae.so.11.0, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libnvToolsExt-847d78f2.so.1, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libcublas-3b81d170.so.11, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libcublasLt-b6d14a74.so.11, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libgomp-a34b3233.so.1, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libnvrtc-672ee683.so.11.2, needed by bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_dynamic_start@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcCreateProgram@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_dynamic_next@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcDestroyProgram@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_ordered_static_start@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_critical_end@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcVersion@libnvrtc.so.11.2' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetPTXSize@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_get_num_threads@OMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetProgramLog@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_sections_end_nowait@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetProgramLogSize@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_ordered_start@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_critical_name_start@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_parallel@GOMP_4.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetLoweredName@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_set_num_threads@OMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_end@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcAddNameExpression@libnvrtc.so.11.2' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_get_num_procs@OMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_single_start@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_ordered_dynamic_start@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_end_nowait@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_ordered_dynamic_next@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetErrorString@libnvrtc.so.11.2' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_barrier@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so: undefined reference to nvtxRangePop@libnvToolsExt.so.1' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_sections_start@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_in_parallel@OMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_ordered_end@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetCUBINSize@libnvrtc.so.11.2' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_critical_start@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_get_thread_num@OMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcCompileProgram@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_parallel_start@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_sections_next@GOMP_1.0'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_critical_name_end@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetCUBIN@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to omp_get_max_threads@OMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ccaffe2___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libcaffe2_nvrtc.so: undefined reference to nvrtcGetPTX@libnvrtc.so.11.2'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so: undefined reference to nvtxRangePushA@libnvToolsExt.so.1' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cuda.so: undefined reference to nvtxMarkA@libnvToolsExt.so.1'
/usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_parallel_end@GOMP_1.0' /usr/bin/ld: bazel-out/k8-fastbuild/bin/_solib_k8/_U@libtorch_Upre_Ucxx11_Uabi_S_S_Ctorch___Uexternal_Slibtorch_Upre_Ucxx11_Uabi_Slib/libtorch_cpu.so: undefined reference to GOMP_loop_ordered_static_next@GOMP_1.0'

@github-actions
Copy link

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working No Activity
Projects
None yet
Development

No branches or pull requests

6 participants