-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: add code example of multi-GPU processing #6186
Comments
That'd be a great idea! @mariosasko or @lhoestq, would it be possible to fix the code snippet or do you have another suggested way for doing this? |
Indeed Not sure about the imbalanced GPU usage though, but maybe you can try using the
In this case you wouldn't need a multiprocessed map no ? Since nn.DataParallel would take care of parallelism |
Adding this Tweet for reference: https://twitter.com/jxmnop/status/1716834517909119019. |
I think the issue is that we set We should use |
Yes. But how to load a model to 2 GPU simultaneously without something like accelerate? |
In case someone also runs into this issue, I wrote a blog post with a complete working example by compiling information from several PRs and issues here. Hope it can help. This issue cost me a few hours. I hope my blog post can save you time before the official document gets fixed. |
Thanks ! I updated the docs in #6550 |
hey @forrestbao , i was too struggling with the same issue for weeks hence i checked out your blog. great work on the blog. i mean to say given that on a multi GPU setup where GPU vram is above 40GB each, after intializing the translation model which is barely 1-2GB in VRAM size, the rest of VRAM sits idle, how could i keep creating multiple instances of the same model on the same GPU for all GPUs to maxmize flops ? |
You can use one single instance on your GPU and increase the batch size until you fill the VRAM |
@lhoestq i tried that, but i noticed that after a certain number of batch_size, using a larger batch_size makes the overall process really slow than using a lower batch_size. |
Hi @lhoestq , could you help with my two questions:
which seems caused by python version. I am using Python 3.10.2. |
Hi !
It's a good practice when doing multiprocessing in python. Depending on the multiprocessing method and your python version, python could re-run the code in your main.py in subprocesses that you don't want to re-run (e.g. recursively spawning processes and failing). Though some multiprocessing methods don't re-run main.py and it appears to be your case ;)
Yes,
|
Thanks @lhoestq for explanation. Is it okay we use |
Not sure whether |
I'm running the code example of multi-GPU processing on a Linux 8x A100 instance. The entire python code run time is 30 seconds faster if I add one line to set torch number of threads immediately after the
FWIW: my instance has these versions.
|
@lhoestq Thanks for the updated GPU multiprocessing documentation! When I tried to add
Do you have any thoughts? |
Hmm first time I see this, and it's even more surprising given there is no generator in |
Feature request
Would be great to add a code example of how to do multi-GPU processing with 🤗 Datasets in the documentation. cc @stevhliu
Currently the docs has a small section on this saying "your big GPU call goes here", however it didn't work for me out-of-the-box.
Let's say you have a PyTorch model that can do translation, and you have multiple GPUs. In that case, you'd like to duplicate the model on each GPU, each processing (translating) a chunk of the data in parallel.
Here's how I tried to do that:
I've personally tried running this script on a machine with 2 A100 GPUs.
Error 1
Running the code snippet above from the terminal (python script.py) resulted in the following error:
Error 2
Then, based on this Stackoverflow answer, I put the
set_start_method("spawn")
section in a try: catch block. This resulted in the following error:So then I put the last line under a
if __name__ == '__main__':
block. Then the code snippet seemed to work, but it seemed that it's only leveraging a single GPU (based on monitoringnvidia-smi
):Both GPUs should have equal GPU usage, but I've always noticed that the last GPU has way more usage than the other ones. This made me think that
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % torch.cuda.device_count())
might not work inside a Python script, especially if done after importing PyTorch?Motivation
Would be great to clarify how to do multi-GPU data processing.
Your contribution
If my code snippet can be fixed, I can contribute it to the docs :)
The text was updated successfully, but these errors were encountered: