Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338

Closed
Riebart opened this issue Jul 19, 2021 · 11 comments · Fixed by #339
Assignees
Labels
bug Something isn't working enhancement New feature or request on deck next in line to be merged

Comments

@Riebart
Copy link

Riebart commented Jul 19, 2021

Describe the bug
When using rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8 on either an RTX 3080 Mobile or A100 MIG partition and when running the intro_01 notebook, the nvmlDeviceGetBrand() call invoked from the pynvml library returns an known device brand.

image

Steps/Code to reproduce bug

  • docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
  • Open the intro_01 notebook.
  • Run the cell that attempts to query the device via pynvml.smi.

Expected behavior
The .DeviceQuery() succeeds without error.

Environment details (please complete the following information):

  • Environment location: Docker (rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8) using nvidia-docker2 installed from official repo as runtime.
  • Method of RAPIDS libraries install: Docker, as per rapids.ai Get Started page.
@Riebart Riebart added the bug Something isn't working label Jul 19, 2021
@Riebart Riebart changed the title [BUG] [BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned Jul 19, 2021
@taureandyernv
Copy link
Contributor

@Riebart , thanks for this issue. Can you share with me the output of nvidia-smi for each one of the GPUs?

pynvml is an external library, so it may be good to send the details of this issue to gpuopenanalytics who owns pynvml: https://github.com/gpuopenanalytics/pynvml

@taureandyernv taureandyernv added the waiting on author feedback left, need author's action label Jul 19, 2021
@taureandyernv taureandyernv self-assigned this Jul 19, 2021
@Riebart
Copy link
Author

Riebart commented Jul 19, 2021

This problem is already reported in two issues on pynvml: here and here.

Output of nvidia-smi on the A100:

# nvidia-smi
Mon Jul 19 20:02:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:0B:00.0 Off |                  Off |
| N/A   30C    P0    34W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0   11   0   0  |    244MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Output on the RTX3080 Mobile (I'll update this comment later and include it).

@taureandyernv
Copy link
Contributor

Awesome. I'll track this. As this is a pynvml issue, would you be able to remove that cell form your workflow or prefer that we comment out or remove that cell and replace it with the standard !nvidia-smi? For the intro notebooks, this is more decoration.

@Riebart
Copy link
Author

Riebart commented Jul 19, 2021

For our use case (training and hands-on workshops), we can comment out/remove that cell, as we're doing other automated transformations on the notebooks to change sample counts to match the size of the MIG slices we're using anyway (since we usually don't have 16GB of VRAM per participant).

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

@rjzamora
Copy link

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

Sorry for the late response here, but I'd like to declare a bit of a warning that the pynvml.smi module is probably not something you should be using. The NVML bindings are effectively kept up to date by the official NVML team, but the smi bindings are not. There is currently only one person who maintains pynvml.smi, and that is not on a regular basis. Therefore, my personal vote is actually to deprecate the module from this repository. You should be able to use NVML to get the information/metrics you need anyway. If enough community-members want to to maintain an smi python API, then it may make more sense to do this in a separate project.

@Riebart
Copy link
Author

Riebart commented Jul 27, 2021

@rjzamora That makes sense to me.

It's important to note that the dask-cuda portions of Intro01 are also broken with MIG partitions, also because of pynvml related things. There's an open PR for that waiting for review, but it seems like MIG is breaking a lot of downstream projects due to what amount to namespace and scoping changes.

@taureandyernv
Copy link
Contributor

Thanks @rjzamora for the information. I'll refactor and update the affected notebooks. When i get the solution PRed I'll reply back and close this issue. Thanks again @Riebart !

@taureandyernv
Copy link
Contributor

@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.

@Riebart
Copy link
Author

Riebart commented Aug 12, 2021

@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.

@taureandyernv Still no joy, but a different error this time. This is related to issues we've observed in other areas, such as dask-cuda (ref and what I believe to be the related PR)

import pynvml
pynvml.nvmlInit()
​
gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
print("your GPU has", gpu_mem, "GB")

---------------------------------------------------------------------------

NVMLError_NoPermission                    Traceback (most recent call last)
<ipython-input-5-2daaad25a9ae> in <module>
      2 pynvml.nvmlInit()
      3 
----> 4 gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
      5 print("your GPU has", gpu_mem, "GB")

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
   1982     fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
   1983     ret = fn(handle, byref(c_memory))
-> 1984     _nvmlCheckReturn(ret)
   1985     return c_memory
   1986 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_NoPermission: Insufficient Permissions

@taureandyernv taureandyernv added enhancement New feature or request on deck next in line to be merged and removed waiting on author feedback left, need author's action labels Aug 12, 2021
@taureandyernv
Copy link
Contributor

taureandyernv commented Aug 12, 2021

Aww man. okay, i'll check that out Monday, unless its P0. Does this NVMLError_NoPermission occur with the Ampere Mobile GPU or the MIG partitions? Can you try it with the Mobile if you haven't? I'll check with @pentschev about some of the subtleties of that PR that i might be missing

can you send me your environment?

@pentschev
Copy link

Just to clarify, rapidsai/dask-cuda#674 has been merged and MIG devices should now be supported by Dask-CUDA. However, it's still not the most user-friendly interface, the only way to enable MIG devices at this time is to specify each MIG instance to CUDA_VISIBLE_DEVICES via their UUIDs, similar to what's show in GPU Instances doc. The UUIDs can be queried with nvidia-smi -L.

With MIG, you can't use pynvml.nvmlDeviceGetHandleByIndex(0), this is the cause for NVMLError_NoPermission. The easiest way is to get the handle by its UUID instead with pynvml.nvmlDeviceGetHandleByUUID(str.encode("MIG-GPU-..."). From that handle, you can then query things like memory, just as you would have done with a physical GPU device handle. You can also get the handle by specifying the device and MIG instance indices with nvmlDeviceGetMigDeviceHandleByIndex(device=0, index=0) (get handle for MIG instance index=0 from physical GPU device=0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request on deck next in line to be merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants