[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

Riebart · 2021-07-19T18:40:02Z

Describe the bug
When using rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8 on either an RTX 3080 Mobile or A100 MIG partition and when running the intro_01 notebook, the nvmlDeviceGetBrand() call invoked from the pynvml library returns an known device brand.

Steps/Code to reproduce bug

docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
Open the intro_01 notebook.
Run the cell that attempts to query the device via pynvml.smi.

Expected behavior
The .DeviceQuery() succeeds without error.

Environment details (please complete the following information):

Environment location: Docker (rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8) using nvidia-docker2 installed from official repo as runtime.
Method of RAPIDS libraries install: Docker, as per rapids.ai Get Started page.

The text was updated successfully, but these errors were encountered:

taureandyernv · 2021-07-19T19:31:22Z

@Riebart , thanks for this issue. Can you share with me the output of nvidia-smi for each one of the GPUs?

pynvml is an external library, so it may be good to send the details of this issue to gpuopenanalytics who owns pynvml: https://github.com/gpuopenanalytics/pynvml

Riebart · 2021-07-19T20:04:23Z

This problem is already reported in two issues on pynvml: here and here.

Output of nvidia-smi on the A100:

# nvidia-smi
Mon Jul 19 20:02:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:0B:00.0 Off |                  Off |
| N/A   30C    P0    34W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0   11   0   0  |    244MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Output on the RTX3080 Mobile (I'll update this comment later and include it).

taureandyernv · 2021-07-19T21:15:34Z

Awesome. I'll track this. As this is a pynvml issue, would you be able to remove that cell form your workflow or prefer that we comment out or remove that cell and replace it with the standard !nvidia-smi? For the intro notebooks, this is more decoration.

Riebart · 2021-07-19T21:56:45Z

For our use case (training and hands-on workshops), we can comment out/remove that cell, as we're doing other automated transformations on the notebooks to change sample counts to match the size of the MIG slices we're using anyway (since we usually don't have 16GB of VRAM per participant).

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

rjzamora · 2021-07-27T15:21:52Z

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

Sorry for the late response here, but I'd like to declare a bit of a warning that the pynvml.smi module is probably not something you should be using. The NVML bindings are effectively kept up to date by the official NVML team, but the smi bindings are not. There is currently only one person who maintains pynvml.smi, and that is not on a regular basis. Therefore, my personal vote is actually to deprecate the module from this repository. You should be able to use NVML to get the information/metrics you need anyway. If enough community-members want to to maintain an smi python API, then it may make more sense to do this in a separate project.

Riebart · 2021-07-27T17:16:18Z

@rjzamora That makes sense to me.

It's important to note that the dask-cuda portions of Intro01 are also broken with MIG partitions, also because of pynvml related things. There's an open PR for that waiting for review, but it seems like MIG is breaking a lot of downstream projects due to what amount to namespace and scoping changes.

taureandyernv · 2021-08-09T20:18:57Z

Thanks @rjzamora for the information. I'll refactor and update the affected notebooks. When i get the solution PRed I'll reply back and close this issue. Thanks again @Riebart !

taureandyernv · 2021-08-09T22:26:58Z

@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.

Riebart · 2021-08-12T20:01:59Z

@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.

@taureandyernv Still no joy, but a different error this time. This is related to issues we've observed in other areas, such as dask-cuda (ref and what I believe to be the related PR)

import pynvml
pynvml.nvmlInit()

gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
print("your GPU has", gpu_mem, "GB")

---------------------------------------------------------------------------

NVMLError_NoPermission                    Traceback (most recent call last)
<ipython-input-5-2daaad25a9ae> in <module>
      2 pynvml.nvmlInit()
      3 
----> 4 gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
      5 print("your GPU has", gpu_mem, "GB")

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
   1982     fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
   1983     ret = fn(handle, byref(c_memory))
-> 1984     _nvmlCheckReturn(ret)
   1985     return c_memory
   1986 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_NoPermission: Insufficient Permissions

taureandyernv · 2021-08-12T20:35:36Z

Aww man. okay, i'll check that out Monday, unless its P0. Does this NVMLError_NoPermission occur with the Ampere Mobile GPU or the MIG partitions? Can you try it with the Mobile if you haven't? I'll check with @pentschev about some of the subtleties of that PR that i might be missing

can you send me your environment?

pentschev · 2021-08-13T08:35:06Z

Just to clarify, rapidsai/dask-cuda#674 has been merged and MIG devices should now be supported by Dask-CUDA. However, it's still not the most user-friendly interface, the only way to enable MIG devices at this time is to specify each MIG instance to CUDA_VISIBLE_DEVICES via their UUIDs, similar to what's show in GPU Instances doc. The UUIDs can be queried with nvidia-smi -L.

With MIG, you can't use pynvml.nvmlDeviceGetHandleByIndex(0), this is the cause for NVMLError_NoPermission. The easiest way is to get the handle by its UUID instead with pynvml.nvmlDeviceGetHandleByUUID(str.encode("MIG-GPU-..."). From that handle, you can then query things like memory, just as you would have done with a physical GPU device handle. You can also get the handle by specifying the device and MIG instance indices with nvmlDeviceGetMigDeviceHandleByIndex(device=0, index=0) (get handle for MIG instance index=0 from physical GPU device=0).

Riebart added the bug Something isn't working label Jul 19, 2021

Riebart changed the title ~~[BUG]~~ [BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned Jul 19, 2021

taureandyernv added the waiting on author feedback left, need author's action label Jul 19, 2021

taureandyernv self-assigned this Jul 19, 2021

taureandyernv mentioned this issue Aug 9, 2021

refactor pynvml.smi calls to pynvml with nvml api #339

Merged

taureandyernv added enhancement New feature or request on deck next in line to be merged and removed waiting on author feedback left, need author's action labels Aug 12, 2021

taureandyernv closed this as completed in #339 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

Riebart commented Jul 19, 2021

taureandyernv commented Jul 19, 2021

Riebart commented Jul 19, 2021

taureandyernv commented Jul 19, 2021

Riebart commented Jul 19, 2021

rjzamora commented Jul 27, 2021

Riebart commented Jul 27, 2021 •

edited

Loading

taureandyernv commented Aug 9, 2021

taureandyernv commented Aug 9, 2021

Riebart commented Aug 12, 2021 •

edited

Loading

taureandyernv commented Aug 12, 2021 •

edited

Loading

pentschev commented Aug 13, 2021

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

Comments

Riebart commented Jul 19, 2021

taureandyernv commented Jul 19, 2021

Riebart commented Jul 19, 2021

taureandyernv commented Jul 19, 2021

Riebart commented Jul 19, 2021

rjzamora commented Jul 27, 2021

Riebart commented Jul 27, 2021 • edited Loading

taureandyernv commented Aug 9, 2021

taureandyernv commented Aug 9, 2021

Riebart commented Aug 12, 2021 • edited Loading

taureandyernv commented Aug 12, 2021 • edited Loading

pentschev commented Aug 13, 2021

Riebart commented Jul 27, 2021 •

edited

Loading

Riebart commented Aug 12, 2021 •

edited

Loading

taureandyernv commented Aug 12, 2021 •

edited

Loading