You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rocm-smi --showpids reports the number of GPUs used by the process.
However, the presentation makes it easy to assume that it shows which GPUs are used.
We are having the users of our application confused, thinking that all the processes run on the same GPU:
$ rocm-smi --showpids========================= ROCm System Management Interface =========================================================== KFD Processes ===================================get_compute_process_info_by_pid, Not supported on the given systemget_compute_process_info_by_pid, Not supported on the given systemget_compute_process_info_by_pid, Not supported on the given systemget_compute_process_info_by_pid, Not supported on the given systemKFD process informationPID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY55573 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN 55571 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN 55574 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN 55572 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN =================================================================================================================== End of ROCm SMI Log ================================
Compare this with how nvidia-smi reports the similar thing:
$ nvidia-smi [......]+-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 0 N/A N/A 211667 C gmx 320MiB || 1 N/A N/A 211667 C gmx 148MiB |+-----------------------------------------------------------------------------+
It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.
The help output is also unclear about the differences between the two options:
--showpids Show current running KFD PIDs
--showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]] Show GPUs used by specified KFD PIDs (all if no arg
given)
With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!
$ rocm-smi --showpids======================= ROCm System Management Interface ======================================================= KFD Processes =================================Not supported on the given systemNot supported on the given systemNot supported on the given systemNot supported on the given systemKFD process information:PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY129835 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN 129836 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN 129834 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN 129837 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN ============================================================================================================= End of ROCm SMI Log ==============================
Operating System
SLES 15
GPU
MI250X
ROCm Component
rocm_smi_lib
The text was updated successfully, but these errors were encountered:
Suggestion Description
rocm-smi --showpids
reports the number of GPUs used by the process.However, the presentation makes it easy to assume that it shows which GPUs are used.
We are having the users of our application confused, thinking that all the processes run on the same GPU:
Compare this with how nvidia-smi reports the similar thing:
It would be better if
rocm-smi --showpids
output was more clear that it reported the number of GPUs used, not their indices.The help output is also unclear about the differences between the two options:
With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!
Operating System
SLES 15
GPU
MI250X
ROCm Component
rocm_smi_lib
The text was updated successfully, but these errors were encountered: