Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run veros with multi-GPU #417

Open
HuangLianghong opened this issue Jan 3, 2023 · 5 comments
Open

How to run veros with multi-GPU #417

HuangLianghong opened this issue Jan 3, 2023 · 5 comments

Comments

@HuangLianghong
Copy link

Hi!
I am trying to run veros with multi-gpu, it works when I run acc_benchmark.py. But when I try to run global_flexible.py with the instruction mpirun -np 2 veros run global_flexible/global_flexible.py -n 1 2 --force-overwrite -b jax --device gpu, it seems that only one GPU is working.
image
Could you please tell me what should I do?
Thanks in advance!

@dionhaefner
Copy link
Collaborator

That usage is correct but you also need to instruct each process about which GPU to use (they default to the first one, so both processes are using the same device in your case).

You can use something like

import os
from mpi4py import MPI

os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI.COMM_WORLD.Get_rank())

near the top of your setup file (before importing veros.core or JAX).

@Sougata18
Copy link

Screenshot_veros

I have been trying to run VEROS with multi-GPU (4 GPUs) with the modification suggested above. But it randomly stops at an iteration and says 'solution diverged'. Also the gpu memory usage is very less. Could you please suggest any solutions ?

Below is the attached log file.
log.txt

@dionhaefner
Copy link
Collaborator

dionhaefner commented Apr 12, 2024

But it randomly stops at an iteration and says 'solution diverged'.

Looks like your solution diverged. Multi-GPU runs use a different linear solver than other configurations by default so you might see divergence for runs that are stable in other settings. Please revise your time steps and / or solver settings.

You can also use different PETSc settings like this: https://github.com/dionhaefner/veros-01deg/blob/4f096b11206fecfac003047a234fcf25f92291a0/global_01deg/global_01deg.py#L14-L16

Note that multi-GPU runs for very high-res setups are not well explored so some manual tweaking is to be expected.

Unless the simulation doesn't fit on a single A100 I would stay away from multi-GPU runs (it probably won't even be faster than single-GPU in this case).

Also the gpu memory usage is very less.

It is not, I see 4 GPUs using ~60GB of memory each, as expected.

@Sougata18
Copy link

Sorry my mistake, memory usage is high but GPU utilization is very less , highest one is 20%

@dionhaefner
Copy link
Collaborator

Yes, another indicator that your GPUs aren't fully exhausted so you should probably not run on multi-GPU in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants