-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi gpu map example #6415
Conversation
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
The documentation is not available anymore as the PR was closed or merged. |
docs/source/process.mdx
Outdated
>>> def gpu_computation(example, rank): | ||
>>> os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % torch.cuda.device_count()) | ||
>>> torch.cuda.set_device(rank % torch.cuda.device_count()) | ||
>>> # Your big GPU call goes here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still would like to see a concrete example here instead of "your big GPU call goes here", cause I tried using an NLLB model with 2 GPUs to translate sentences of the datacomp dataset in parallel and it was unclear for me how I had to do it. Should I use nn.DataParallel? Should I use .to("cuda:0") and .to("cuda:1")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that the rank was always set to 0, so all work was done on the first GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that the rank was always set to 0, so all work was done on the first GPU
This happens only if you set num_proc=1, but for multiprocessing you get multiple ranks (one per process)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added model
in the example - no need to use nn.DataParallel. You just need to send the model to every GPU.
Feel free to test that the code works as expected for you !
docs/source/process.mdx
Outdated
>>> for i in range(torch.cuda.device_count()): # send model to every GPU | ||
... model.to(torch.cuda.device(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives me the following error:
Traceback (most recent call last):
File "/home/niels/python_projects/datacomp/datasets_multi_gpu.py", line 14, in <module>
model.to(torch.cuda.device(i))
File "/home/niels/anaconda3/envs/datacomp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 968, in to
device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, **kwargs)
TypeError: to() received an invalid combination of arguments - got (device), but expected one of:
* (torch.device device, torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
* (torch.dtype dtype, bool non_blocking, bool copy, *, torch.memory_format memory_format)
* (Tensor tensor, bool non_blocking, bool copy, *, torch.memory_format memory_format)
I used this instead:
for i in range(torch.cuda.device_count()): # send model to every GPU
model.to(f"cuda:{i}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it, thanks
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Merging this one, but lmk if you have more comments for subsequent improvements @NielsRogge |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This is a little hard to follow — where is the documentation currently? I am trying to follow from snippets, here is what I have based on your convo in this thread:
but I'm getting device errors (data is on device 3, but it thinks model is on device 0, despite setting Is this correct? What version of Torch are you using for this? |
Anyway, this didn't work for me:
but substituting it for:
( (btw, versions: |
Yeah for me this issue isn't resolved yet, we need a better code example |
Hi @alex2awesome, could you open a PR with your suggestion to improve this code snippet ? |
i'm happy to when i get it fully working, but i feel like there are some fundamentals that I'm not fully understanding... I've set it up twice now, for 2 GPU-processing pipelines. In one pipelines, my memory usage is fine, it delivers me a huge speedup, and everything is great. In the second pipeline, I keep getting OOM errors when There is a discussion here: pytorch/pytorch#44156 about CUDA memory leaks in multiprocessing setups, and I haven't had the time to fully read the source code to So, I haven't fully tested out enough to see what the issue. If I feel comfortable over the next several days to generate a slimmed-down example that will generalize to real-world cases such as those I'm working with now, then I will contribute it. |
@lhoestq do you know how If so, there are lots of points around memory usage, here: EDIT: ahh I see it is using python's native multiprocessing library: https://github.com/huggingface/datasets/blob/2.15.0/src/datasets/arrow_dataset.py#L3172-L3189 |
After some more research and playing around, I can't pinpoint the source of my CUDA memory leak nor can I determine with confidence what works and what doesn't in this setup. I'm not really an expert on multiprocessing in general, but my gut is that the current set-up isn't ideal for multiprocessing and I'm not sure I would recommend users to do this. Kinda unfortunate, because I don't see any great tools for distributed inference out there, and in theory, Are either of you more experienced in this? |
Not sure about your GPU's OOM :/ Still, I opened a PR with your suggestion here: #6550 |
I still get only 0 rank... Here is my code: https://pastebin.com/c6du8jaM from this ^ i just improt one function:
And here is the traceback: |
Also this code from your docs is not valid (source: https://huggingface.co/docs/datasets/main/en/process#multiprocessing):
This for me sends the model only to the second GPU
|
Could you please provide a working example of multi-GPU mapping? Not just an example in docs, but a real working example starting from all imports loading datasets and models. |
@alex2awesome the same issue with CUDA OOM. It should not be happening, since it should 2 different GPUs be handling different loads. But in fact something wrong is happening. |
I haven't experimented much with the multi-GPU code documentation. Can you try using the code example at #6550 instead ? That would be super helpful if you could confirm that it works on your side Though if you have some fixes/improvements ideas feel free to open a PR ! |
@lhoestq the mapping does not start at all in this case: Here is the updated code: https://pastebin.com/Kn9aGfZr |
@lhoestq with this code: https://pastebin.com/muDm78kp
Also when trying to download my dataset there were no issues from one machine, but from another:
Can't download my dataset at all... |
Hmm this is not good, do you know a way to make it work ? Basically |
I can confirm that PR #6550 works. All GPUs are at full throttle. You have to manually move the model to all GPUs.
|
I wrote a blog post with a complete example by compiling information from several PRs and issues here. Hope it can help. Let me know how it works.
|
orch.cuda.set_device
instead ofCUDA_VISIBLE_DEVICES
if __name__ == "__main__"
fix #6186