Change run_exports from major.minor to major for CUDA>=10.1 #35

isuruf · 2019-12-19T22:29:19Z

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-general-new-features

jakirkham · 2019-12-19T23:13:58Z

I'm not sure that is what that means.

Inside the cudatoolkit package, there are things like libcudart.so.10.2. If a library links against that (as cupy does, sorry Azure changed UI so will need to scroll), then it will be broken.

cc @kkraus14 @mike-wendt

jjhelmus · 2019-12-20T16:03:56Z

The relevant text from the release is:

Also in this release the soname of the libraries has been modified to not include the minor toolkit version number. For example, the cuFFT library soname has changed from libcufft.so.10.1 to libcufft.so.10. This is done to facilitate any future library updates that do not include API breaking changes without the need to relink.

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

leofang · 2019-12-20T16:24:06Z

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

This is my experience too. Please don’t do this before we can confirm NVIDIA stabilizes its versioning scheme. Think about the nuisance of 10.1 Update 0/1/2 just not long ago...

isuruf · 2019-12-20T16:56:07Z

My experience is that although the soname only include the major version, relinking is still needed when switching between minor versions.

I don't understand. Can you explain?

jjhelmus · 2019-12-20T18:18:32Z

If I install PyTorch and Tensorflow built with cudatoolkit 10.0, then remove cudatoolkit 10.0 and install 10.1 both fail to run test scripts:

# python gpu_test.py
2019-12-20 18:13:14.543411: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-20 18:13:14.571836: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.572530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.845
pciBusID: 0000:01:00.0
2019-12-20 18:13:14.572616: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572666: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572699: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572730: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572761: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.572797: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2019-12-20 18:13:14.574955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-20 18:13:14.574967: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-12-20 18:13:14.575226: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-12-20 18:13:14.597185: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-12-20 18:13:14.599534: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5598694d25c0 executing computations on platform Host. Devices:
2019-12-20 18:13:14.599599: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-12-20 18:13:14.599790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-20 18:13:14.599836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
2019-12-20 18:13:14.662175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-20 18:13:14.662776: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559869535230 executing computations on platform CUDA. Devices:
2019-12-20 18:13:14.662791: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
Traceback (most recent call last):
  File "gpu_test.py", line 5, in <module>
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/opt/conda/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 96, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

# conda activate pytorch
(pytorch) [root@chi9 io]# python pytorch_test.py
Traceback (most recent call last):
  File "pytorch_test.py", line 3, in <module>
    import torch
  File "/opt/conda/envs/pytorch/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Both of these package are trying to dlopen the major.minor libraries. Perhaps it is possible for these projects to switch to using the major only library but this is not how they are currently setup.

isuruf · 2019-12-20T18:37:59Z

That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2

jjhelmus · 2019-12-20T18:51:42Z

Unfortunately, I do not have packages nor a machine configured to test 10.1 vs 10.2 at the moment.

jakirkham · 2019-12-20T19:06:42Z

That doesn't work. This only works with 10.1 onwards as the link mentions. Try doing the same with 10.1 and 10.2

@isuruf, as noted above it doesn't. libcudart includes the major and minor version in the SONAME.

isuruf · 2019-12-20T19:16:36Z

Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable

jjhelmus · 2019-12-20T19:36:54Z

Examining the runtime docker images from docker hub it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.

These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing cudatoolkit packages will likely need to have a run_constrain added to avoid clobbering.

jakirkham · 2019-12-20T19:42:01Z

Ah, then we should split cudatoolkit into 2 packages so that CUDA packages built with 10.1 will get the benefits of 10.2 where applicable

That's an interesting idea. Could be reasonable. Have not personally explored this.

@kkraus14 @mike-wendt, do you have any thoughts on this idea?

kkraus14 · 2019-12-23T18:55:51Z

I'm not opposed to the idea of turning cudatoolkit into a metapackage and breaking it up, what would be the proposed split of packages?

jakirkham · 2020-01-08T19:11:52Z

IIUC it would be split along the lines of which libraries include the CUDA minor version (like .1 or .2) in their SONAME or not. Though I suppose it could be more granular than that. Does this sound correct to you @isuruf or did you have something else in mind?

leofang · 2020-01-09T18:46:35Z

Sorry for a stupid question: If we split cudatoolkit, what would happen when we check the runtime versions via cudaRuntimeGetVersion and individual libraries' API? Detecting versions at runtime correctly is important at least for CuPy afaik.

jakirkham · 2020-01-10T02:31:28Z

I think cudaRuntimeGetVersion comes from the CUDA Runtime API (libcudart). So that would still be tracking the patch version.

leofang · 2020-01-15T16:15:19Z

Thanks @jakirkham. So sounds like with the splitting we could get a 10.2 runtime but, say, a 10.1 cuFFT or cuRAND coexist.

Sorry I wasn't paying attention to @jjhelmus's original comment:

it appears as if most of the libraries use a major only SONAME but three (libcudart.so, libnvrtc-builtins.so and libnvrtc.so.10.2) use major.minor.

These two groups could made into two different conda packages so that the compatible libraries can be installed into a 10.1 environment. The existing cudatoolkit packages will likely need to have a run_constrain added to avoid clobbering.

So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying...

isuruf · 2020-01-18T14:40:39Z

So would this work for applications depending on NVRTC, built with 10.1, and running with 10.2? I don't see any guarantee of API/ABI compatibility mentioned in NVRTC's documentation, so if its SOs' names have major.minor, this is a bit worrying

Please read @jjhelmus's comment carefully. NVRTC (and CUDART) would be in the group of packages that pins to major.minor and the others would be in the group of packages that are pinned to major.

isuruf mentioned this issue Dec 19, 2019

Adding CUDA 10.2 conda-forge/conda-forge-pinning-feedstock#361

Closed

6 tasks

leofang mentioned this issue Jan 28, 2021

Split recipe in components conda-forge/cudatoolkit-feedstock#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change run_exports from major.minor to major for CUDA>=10.1 #35

Change run_exports from major.minor to major for CUDA>=10.1 #35

isuruf commented Dec 19, 2019

jakirkham commented Dec 19, 2019

jjhelmus commented Dec 20, 2019

leofang commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

jakirkham commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

jakirkham commented Dec 20, 2019

kkraus14 commented Dec 23, 2019

jakirkham commented Jan 8, 2020

leofang commented Jan 9, 2020

jakirkham commented Jan 10, 2020

leofang commented Jan 15, 2020

isuruf commented Jan 18, 2020

Change run_exports from major.minor to major for CUDA>=10.1 #35

Change run_exports from major.minor to major for CUDA>=10.1 #35

Comments

isuruf commented Dec 19, 2019

jakirkham commented Dec 19, 2019

jjhelmus commented Dec 20, 2019

leofang commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

jakirkham commented Dec 20, 2019

isuruf commented Dec 20, 2019

jjhelmus commented Dec 20, 2019

jakirkham commented Dec 20, 2019

kkraus14 commented Dec 23, 2019

jakirkham commented Jan 8, 2020

leofang commented Jan 9, 2020

jakirkham commented Jan 10, 2020

leofang commented Jan 15, 2020

isuruf commented Jan 18, 2020