adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

stas00 · 2020-11-16T05:32:21Z

I understand that this is a python binding to nvml, which ignores CUDA_VISIBLE_DEVICES, but perhaps this feature could be respected in pynvml? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.

For example on this setup I have card 0 (24GB), card 1 (8GB).

If I run:

CUDA_VISIBLE_DEVICES=1 python -c "import pynvml; pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(0); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
25447170048

which is the output for card 0, even though I was expecting output for card 1.

The expected output is the one if I explicitly pass the system ID to nvml:

python -c "import pynvml;pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(1); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
8513978368

So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.

The conflict with pytorch happens when I call id = torch.cuda.current_device() - which returns 0 with CUDA_VISIBLE_DEVICES="1". I hope my explanation is clear of where I have a problem.

pynvml could respect CUDA_VISIBLE_DEVICES if the latter is set.

Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if pynvml.nvmlInit(respect_cuda_visible_devices=True) is passed, then it could magically remap the id arg to nvmlDeviceGetHandleByIndex to the corresponding id in CUDA_VISIBLE_DEVICES. So in the very first snippet above, nvmlDeviceGetHandleByIndex(0), will actually call it for id=1, as it's 0th relative to `CUDA_VISIBLE_DEVICES="1".

So nvmlDeviceGetHandleByIndex() arg will become an index with respect to CUDA_VISIBLE_DEVICES. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.

Thank you!

Meanwhile I added the following workaround to my software:

    import os
    [...]
    if id is None:
        id = torch.cuda.current_device()
    # if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
        id = ids[id] # remap
    try:
        handle = pynvml.nvmlDeviceGetHandleByIndex(id)
        [...]

If someone needs this as a helper wrapper, you can find it here:
https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33

The text was updated successfully, but these errors were encountered:

rjzamora · 2020-11-16T23:40:05Z

Thanks for raising an issue @stas00 !

I understand the confusion/frustration regarding CUDA_VISIBLE_DEVICES here. That environment variable specifies devices that a CUDA application can use at run time, but NVLM/PyNVML is an API for monitoring the system-level state of GPUs (and doesn't have anything to do with CUDA). For this reason, it doesn't really make sense for an API like nvmlDeviceGetHandleByIndex to return an index that respects CUDA_VISIBLE_DEVICES. With that said, there is no reason we cant provide a user-friendly mechanism to translate a CUDA-visible device index into a system-level device index.

From a user perspective, I like your suggestion to add an optional argument to something like nvmlInit. However, I'd be hesitant to make any change that will result in a pyNVML function behaving differently than the NVML function of the same name. What if we add a separate API (something like cuda_id_to_index) to do the same kind of mapping that you currently needing to do yourself? The user would still need to perform this cuda-to-system index translation in their code, but it could be as simple as an extra API call: e.g. pynvml.nvmlDeviceGetHandleByIndex(pynvml.cuda_id_to_index(1))

If the new API seems too messy, perhaps it would be better to add the optional kwarg to just the nvmlDeviceGetHandleByIndex function, so that something like this would work: pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True)

stas00 · 2020-11-16T23:47:49Z

Both of your suggestions sound good to me, @rjzamora. I'm having a hard time deciding which one I prefer. I think the latter since it doesn't require an intermediary variable, which then may lead to a confusion in the code, as now it's easy to confuse which of the 2 ids to use for other non-pynvml functionality (if one doesn't stack calls). So my preference is pynvml.nvmlDeviceGetHandleByIndex(1, cuda_device=True). But either of them works.

If it's helpful please feel free to re-use the little remapper I wrote :)

And thank you so much for making pynvml much more than just a set of bindings!

kenhester · 2021-02-26T14:30:08Z

I might suggest change cuda_device=True to use_cuda_visible_device=True to be explicit.

Thoughts

Alex-ley · 2021-10-27T18:34:34Z

Be careful with this automagically remapping - this only works if the following is set:

import os
# https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

OR:

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,1"

If this environment variable is not set CUDA defaults to:

CUDA_DEVICE_ORDER="FASTEST_FIRST"

which means that your fastest GPU is set to index zero and so if your fastest GPU is in your device 1 (2nd device) slot in your machine then it will actually be at index zero and the code above would fail. You probably need to check both:

if "CUDA_VISIBLE_DEVICES" in os.environ:
    if "CUDA_DEVICE_ORDER" not in os.environ or os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST":
        # do something to tell the user this won't work
        raise ValueError('''We can't remap if you are using os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST"''')

see this article for more details: https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/

tmm1 · 2023-08-21T22:47:19Z

cuda_device=True to use_cuda_visible_device=True

were either of these implemented?

alternatively, is there a way to get the nvml compatible index given a torch.Device?

rmccorm4 mentioned this issue Oct 19, 2022

Use NVML to Acquire GPU UUIDs triton-inference-server/model_analyzer#552

Closed

lipengfeizju mentioned this issue Jun 6, 2024

Adding support for CUDA_VISIBLE_DEVICES mlco2/codecarbon#567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

stas00 commented Nov 16, 2020 •

edited

Loading

rjzamora commented Nov 16, 2020

stas00 commented Nov 16, 2020 •

edited

Loading

kenhester commented Feb 26, 2021

Alex-ley commented Oct 27, 2021 •

edited

Loading

tmm1 commented Aug 21, 2023 •

edited

Loading

adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28

Comments

stas00 commented Nov 16, 2020 • edited Loading

rjzamora commented Nov 16, 2020

stas00 commented Nov 16, 2020 • edited Loading

kenhester commented Feb 26, 2021

Alex-ley commented Oct 27, 2021 • edited Loading

tmm1 commented Aug 21, 2023 • edited Loading

stas00 commented Nov 16, 2020 •

edited

Loading

stas00 commented Nov 16, 2020 •

edited

Loading

Alex-ley commented Oct 27, 2021 •

edited

Loading

tmm1 commented Aug 21, 2023 •

edited

Loading