Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Degraded IB performance #1377

Open
adk9 opened this issue Oct 14, 2024 · 0 comments
Open

[Issue]: Degraded IB performance #1377

adk9 opened this issue Oct 14, 2024 · 0 comments

Comments

@adk9
Copy link

adk9 commented Oct 14, 2024

Problem Description

I'm running all_reduce_perf on two IB-enabled nodes, and seeing much lower-than-expected bandwidth over IB between these two nodes.

The two nodes have kernel 6.5.0-1025-azure. On further debugging, the issue seems to stem from lack of GDRDMA enablement.
I'm using latest rccl develop (a680e32) which already has commit (105ff16, PR #1328) and is supposed to enabled GDR on HWE 6.5 kernel.

When looking at ncclIbGdrSupport() in src/transport/net_ib.cc, I see:

    if (strncmp("Hyper-V UEFI Release", strValue, 20) == 0) {
      int roMode = ncclParamIbPciRelaxedOrdering();
      NCCLCHECK(ncclTopoGetStrFromSys("/proc/sys/kernel", "numa_balancing", strValue));
      if (strcmp(strValue, "1") == 0 && roMode == 0)
        moduleLoaded = 0;
    } else if (moduleLoaded == 0) {
      // Check for `ib_register_peer_memory_client` symbol in `/proc/kallsyms`
      // if your system uses native OS ib_peer module
      char buf[256];
      FILE *fp = NULL;

It seems like, in my case, the else-if block is never exercised because I'm on Hyper-V UEFI release v4.1.

This seems like a bug because on this kernel ib_register_peer_memory_client seems to be available:

$ cat /proc/kallsyms  | grep ib_register_peer_memory_client
0000000000000000 b pfn_ib_register_peer_memory_client   [amdgpu]
0000000000000000 r __kstrtab_ib_register_peer_memory_client     [ib_uverbs]
0000000000000000 r __kstrtabns_ib_register_peer_memory_client   [ib_uverbs]
0000000000000000 r __ksymtab_ib_register_peer_memory_client     [ib_uverbs]
0000000000000000 r __crc_ib_register_peer_memory_client [ib_uverbs]
0000000000000000 r __export_symbol_ib_register_peer_memory_client       [ib_uverbs]
0000000000000000 t __pfx_ib_register_peer_memory_client [ib_uverbs]
0000000000000000 T ib_register_peer_memory_client       [ib_uverbs]

Operating System

Ubuntu 22.04.5 LTS

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant