-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for CUDA_VISIBLE_DEVICES which is currently ignored #28
Comments
Thanks for raising an issue @stas00 ! I understand the confusion/frustration regarding From a user perspective, I like your suggestion to add an optional argument to something like If the new API seems too messy, perhaps it would be better to add the optional kwarg to just the |
Both of your suggestions sound good to me, @rjzamora. I'm having a hard time deciding which one I prefer. I think the latter since it doesn't require an intermediary variable, which then may lead to a confusion in the code, as now it's easy to confuse which of the 2 ids to use for other non-pynvml functionality (if one doesn't stack calls). So my preference is If it's helpful please feel free to re-use the little remapper I wrote :) And thank you so much for making pynvml much more than just a set of bindings! |
I might suggest change cuda_device=True to use_cuda_visible_device=True to be explicit. Thoughts |
Be careful with this automagically remapping - this only works if the following is set: import os
# https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" OR: export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,1" If this environment variable is not set CUDA defaults to: CUDA_DEVICE_ORDER="FASTEST_FIRST" which means that your fastest GPU is set to index zero and so if your fastest GPU is in your device 1 (2nd device) slot in your machine then it will actually be at index zero and the code above would fail. You probably need to check both: if "CUDA_VISIBLE_DEVICES" in os.environ:
if "CUDA_DEVICE_ORDER" not in os.environ or os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST":
# do something to tell the user this won't work
raise ValueError('''We can't remap if you are using os.environ["CUDA_DEVICE_ORDER"] == "FASTEST_FIRST"''') see this article for more details: https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/ |
were either of these implemented? alternatively, is there a way to get the nvml compatible index given a torch.Device? |
I understand that this is a python binding to nvml, which ignores
CUDA_VISIBLE_DEVICES
, but perhaps this feature could be respected inpynvml
? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.For example on this setup I have card 0 (24GB), card 1 (8GB).
If I run:
which is the output for card 0, even though I was expecting output for card 1.
The expected output is the one if I explicitly pass the system ID to nvml:
So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.
The conflict with
pytorch
happens when I callid = torch.cuda.current_device()
- which returns0
withCUDA_VISIBLE_DEVICES="1"
. I hope my explanation is clear of where I have a problem.pynvml could respect
CUDA_VISIBLE_DEVICES
if the latter is set.Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if
pynvml.nvmlInit(respect_cuda_visible_devices=True)
is passed, then it could magically remap theid
arg tonvmlDeviceGetHandleByIndex
to the corresponding id inCUDA_VISIBLE_DEVICES
. So in the very first snippet above,nvmlDeviceGetHandleByIndex(0)
, will actually call it forid=1
, as it's0th
relative to `CUDA_VISIBLE_DEVICES="1".So
nvmlDeviceGetHandleByIndex()
arg will become an index with respect toCUDA_VISIBLE_DEVICES
. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.Thank you!
Meanwhile I added the following workaround to my software:
If someone needs this as a helper wrapper, you can find it here:
https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33
The text was updated successfully, but these errors were encountered: