Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Better headers in --showpids #166

Open
al42and opened this issue May 2, 2024 · 0 comments
Open

[Feature]: Better headers in --showpids #166

al42and opened this issue May 2, 2024 · 0 comments

Comments

@al42and
Copy link

al42and commented May 2, 2024

Suggestion Description

rocm-smi --showpids reports the number of GPUs used by the process.
However, the presentation makes it easy to assume that it shows which GPUs are used.

We are having the users of our application confused, thinking that all the processes run on the same GPU:

$ rocm-smi --showpids


========================= ROCm System Management Interface =========================
================================== KFD Processes ===================================
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
KFD process information
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
55573   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55571   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55574   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55572   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
====================================================================================
=============================== End of ROCm SMI Log ================================

Compare this with how nvidia-smi reports the similar thing:

$ nvidia-smi 
[......]
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211667      C   gmx                               320MiB |
|    1   N/A  N/A    211667      C   gmx                               148MiB |
+-----------------------------------------------------------------------------+

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The help output is also unclear about the differences between the two options:

  --showpids                                                       Show current running KFD PIDs
  --showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]]                    Show GPUs used by specified KFD PIDs (all if no arg
                                                                   given)

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

$ rocm-smi  --showpids


======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
Not supported on the given system
Not supported on the given system
Not supported on the given system
Not supported on the given system
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
129835  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129836  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129834  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129837  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
================================================================================
============================= End of ROCm SMI Log ==============================

Operating System

SLES 15

GPU

MI250X

ROCm Component

rocm_smi_lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants