Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 GiB headroom might be too small #220

Open
Namnamseo opened this issue May 31, 2024 · 0 comments
Open

1 GiB headroom might be too small #220

Namnamseo opened this issue May 31, 2024 · 0 comments

Comments

@Namnamseo
Copy link

Namnamseo commented May 31, 2024

Environment

This is on a node with 8 A100-SXM4-80GB GPUs, connected to 6 NVSwitches.
I'm not familiar with its topology in detail, but I can tell:

  • each GPU has 12 (25 GB/s each) NVLink connections (in nvidia-smi nvlink -s);
  • each NVSwitch has 16 NVLink connections, 2 per GPU (in /var/log/fabricmanager.log);
  • every pair of GPU has a working NVLink-based connection (all NV12 in nvidia-smi topo -m).

NCCL version: 2.21.5
nccl-tests version: v2.13.9 (latest)

Issue

While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in nccl-tests.

Memory allocations

From what I see, the test initialize in two steps:

  • allocate (in advance) buffers using "max bytes" size
  • call ncclCommInitAll(), and start working on the test

This "max bytes" is determined with a couple of factors:

  • collective operation we're using (each operation has its own way to calculate memory footprint: AllReduceGetCollByteCount(), ReduceScatterGetCollByteCount(), ...)
  • total device memory (cudaDeviceProp::totalGlobalMem)
  • CLI argument

Specifically, the test caps the max bytes with: (total device memory - 1GiB) / 3.
(assuming default settings where datacheck is enabled (-c=1))

size_t memMaxBytes = (maxMem - (1<<30)) / (datacheck ? 3 : 2);

This results in 27971332778 bytes (about 26.05 GiB) exact of limit on my machine.

Problem

However, we're seeing errors with high max_bytes (-e) parameters.
For example, in this A100 with 80GiB memory, giving -e 27766464617 (about 25.86 GiB) crashes the test all_reduce_perf, during the call to ncclCommInitAll().
The breaking point should be somewhere around that; -e 27066464617 (about 25.21 GiB) doesn't crash.

(Note that all_reduce_perf has one of the highest GPU memory footprint (link: all_reduce.cu). Some other tests divides the payload across ranks. Thus, the same parameter works alright for ./reduce_scatter_perf, for example.)

NCCL trace shows it's an OOM:

$ NCCL_DEBUG=WARN NCCL_DEBUG_SUBSYS=ALL ./all_reduce_perf -b 27766464617 -e 27766464617 -w 0 -n 1 -t 4
# nThread 4 nGpus 1 minBytes 27766464617 maxBytes 27766464617 step: 1048576(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1206071 on <node name> device  0 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 1206071 on <node name> device  1 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 1206071 on <node name> device  2 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 1206071 on <node name> device  3 [0x00] NVIDIA A100-SXM4-80GB
NCCL version 2.21.5+cuda12.5
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 27766464616    6941616154     float     sum      -1
<node name>:1206071:1206148 [1] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. <node name> pid 1206071: Test failure common.cu:377
 .. <node name> pid 1206071: Test failure common.cu:413
 .. <node name> pid 1206071: Test failure common.cu:603
 .. <node name> pid 1206071: Test failure all_reduce.cu:90
 .. <node name> pid 1206071: Test failure common.cu:615

<node name>:1206071:1206147 [2] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. <node name> pid 1206071: Test failure common.cu:377
 .. <node name> pid 1206071: Test failure common.cu:413
 .. <node name> pid 1206071: Test failure common.cu:603
 .. <node name> pid 1206071: Test failure all_reduce.cu:90
 .. <node name> pid 1206071: Test failure common.cu:615

With smaller NCCL_BUFFSIZE such as 65536 (64KiB, instead of the default 4194304 = 4 MiB), the test actually passes.

I am not sure whether NCCL is indeed loading as many as 256* buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised.
Maybe there's NVSwitch in play as well: more links, more buffers.
* 256 = headroom 1 GiB / default buffsize 4 MiB

Changing that line (maxMem - (1<<30)) from 1<<30 to 1ull<<31 immediately solved the problem,
and it works in every cases, however absurd the max_bytes (-e) goes.

I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant