-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run veros with multi-GPU #417
Comments
That usage is correct but you also need to instruct each process about which GPU to use (they default to the first one, so both processes are using the same device in your case). You can use something like import os
from mpi4py import MPI
os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI.COMM_WORLD.Get_rank()) near the top of your setup file (before importing |
I have been trying to run VEROS with multi-GPU (4 GPUs) with the modification suggested above. But it randomly stops at an iteration and says 'solution diverged'. Also the gpu memory usage is very less. Could you please suggest any solutions ? Below is the attached log file. |
Looks like your solution diverged. Multi-GPU runs use a different linear solver than other configurations by default so you might see divergence for runs that are stable in other settings. Please revise your time steps and / or solver settings. You can also use different PETSc settings like this: https://github.com/dionhaefner/veros-01deg/blob/4f096b11206fecfac003047a234fcf25f92291a0/global_01deg/global_01deg.py#L14-L16 Note that multi-GPU runs for very high-res setups are not well explored so some manual tweaking is to be expected. Unless the simulation doesn't fit on a single A100 I would stay away from multi-GPU runs (it probably won't even be faster than single-GPU in this case).
It is not, I see 4 GPUs using ~60GB of memory each, as expected. |
Sorry my mistake, memory usage is high but GPU utilization is very less , highest one is 20% |
Yes, another indicator that your GPUs aren't fully exhausted so you should probably not run on multi-GPU in the first place. |
Hi!
I am trying to run veros with multi-gpu, it works when I run
acc_benchmark.py
. But when I try to runglobal_flexible.py
with the instructionmpirun -np 2 veros run global_flexible/global_flexible.py -n 1 2 --force-overwrite -b jax --device gpu
, it seems that only one GPU is working.Could you please tell me what should I do?
Thanks in advance!
The text was updated successfully, but these errors were encountered: