Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change run_exports from major.minor to major for CUDA>=10.1 #35

Open
isuruf opened this issue Dec 19, 2019 · 17 comments
Open

Change run_exports from major.minor to major for CUDA>=10.1 #35

isuruf opened this issue Dec 19, 2019 · 17 comments

Comments

@isuruf
Copy link
Member

isuruf commented Dec 19, 2019

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-general-new-features

@jakirkham
Copy link
Member

I'm not sure that is what that means.

Inside the cudatoolkit package, there are things like libcudart.so.10.2. If a library links against that (as cupy does, sorry Azure changed UI so will need to scroll), then it will be broken.

cc @kkraus14 @mike-wendt

@jjhelmus
Copy link

The relevant text from the release is:

Also in this release the soname of the libraries has been modified to not include the minor toolkit version number. For example, the cuFFT library soname has changed from libcufft.so.10.1 to libcufft.so.10. This is done to facilitate any future library updates that do not include API breaking changes without the need to relink.

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

@leofang
Copy link
Member

leofang commented Dec 20, 2019

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

This is my experience too. Please don’t do this before we can confirm NVIDIA stabilizes its versioning scheme. Think about the nuisance of 10.1 Update 0/1/2 just not long ago...

@isuruf
Copy link
Member Author

isuruf commented Dec 20, 2019

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

I don't understand. Can you explain?

@jjhelmus
Copy link

If I install PyTorch and Tensorflow built with cudatoolkit 10.0, then remove cudatoolkit 10.0 and install 10.1 both fail to run test scripts:

# python gpu_test.py
2019-12-20 18:13:14.543411: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-20 18:13:14.571836: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.572530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.845
pciBusID: 0000:01:00.0
2019-12-20 18:13:14.572616: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572666: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572699: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572730: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572761: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572797: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.574955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-20 18:13:14.574967: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-12-20 18:13:14.575226: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-12-20 18:13:14.597185: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-12-20 18:13:14.599534: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5598694d25c0 executing computations on platform Host. Devices:
2019-12-20 18:13:14.599599: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-12-20 18:13:14.599790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-20 18:13:14.599836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
2019-12-20 18:13:14.662175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.662776: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559869535230 executing computations on platform CUDA. Devices:
2019-12-20 18:13:14.662791: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
Traceback (most recent call last):
  File "gpu_test.py", line 5, in <module>
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 96, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.
# conda activate pytorch
(pytorch) [root@chi9 io]# python pytorch_test.py
Traceback (most recent call last):
  File "pytorch_test.py", line 3, in <module>
    import torch
  File "/opt/conda/envs/pytorch/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Both of these package are trying to dlopen the major.minor libraries. Perhaps it is possible for these projects to switch to using the major only library but this is not how they are currently setup.

@isuruf
Copy link
Member Author

isuruf commented Dec 20, 2019

That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2

@jjhelmus
Copy link

Unfortunately, I do not have packages nor a machine configured to test 10.1 vs 10.2 at the moment.

@jakirkham
Copy link
Member

That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2

@isuruf, as noted above it doesn't. libcudart includes the major and minor version in the SONAME.

@isuruf
Copy link
Member Author

isuruf commented Dec 20, 2019

Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable

@jjhelmus
Copy link

Examining the runtime docker images from docker hub it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.

These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing cudatoolkit packages will likely need to have a run_constrain added to avoid clobbering.

@jakirkham
Copy link
Member

Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable

That's an interesting idea. Could be reasonable. Have not personally explored this.

@kkraus14 @mike-wendt, do you have any thoughts on this idea?

@kkraus14
Copy link
Contributor

I'm not opposed to the idea of turning cudatoolkit into a metapackage and breaking it up, what would be the proposed split of packages?

@jakirkham
Copy link
Member

IIUC it would be split along the lines of which libraries include the CUDA minor version (like .1 or .2) in their SONAME or not. Though I suppose it could be more granular than that. Does this sound correct to you @isuruf or did you have something else in mind?

@leofang
Copy link
Member

leofang commented Jan 9, 2020

Sorry for a stupid question: If we split cudatoolkit, what would happen when we check the runtime versions via cudaRuntimeGetVersion and individual libraries' API? Detecting versions at runtime correctly is important at least for CuPy afaik.

@jakirkham
Copy link
Member

I think cudaRuntimeGetVersion comes from the CUDA Runtime API (libcudart). So that would still be tracking the patch version.

@leofang
Copy link
Member

leofang commented Jan 15, 2020

Thanks @jakirkham. So sounds like with the splitting we could get a 10.2 runtime but, say, a 10.1 cuFFT or cuRAND coexist.

Sorry I wasn't paying attention to @jjhelmus's original comment:

it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.

These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing cudatoolkit packages will likely need to have a run_constrain added to avoid clobbering.

So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying...

@isuruf
Copy link
Member Author

isuruf commented Jan 18, 2020

So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying

Please read @jjhelmus's comment carefully. NVRTC (and CUDART) would be in the group of packages that pins to major.minor and the others would be in the group of packages that are pinned to major.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants