Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The network bandwidth in the alltoall_perf test failed to meet expectations #209

Open
fj1425fj opened this issue Apr 20, 2024 · 4 comments

Comments

@fj1425fj
Copy link

RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool.
When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.

Have you ever encountered this phenomenon?
What are the possible reasons for this phenomenon? Looking forward to your reply.

the nccl-tests result is following

mpirun --allow-run-as-root --host xxxx -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_GID_INDEX=3 -x LD_LIBRARY_PATH=/root/nccl-bond/build/lib:$LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME==bond0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=136 -x NCCL_IB_HCA==mlx5_bond_0 -x NCCL_P2P_DISABLE=1 -x NCCL_SHM_DISABLE=1 /home/test/nccl-tests/build/alltoall_perf -b 2M -e 4096M -f 2 -g 2 -n 20

Test results of four devices:

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2682161 on server1 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 2682162 on server1 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 3139299 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 3139300 on  server2 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid  50064 on  server3 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid  50065 on  server3 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 2672680 on server4 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 2672681 on server4 device  2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864       2097152     float    none      -1   2665.2   25.18   22.03      0   2662.0   25.21   22.06    N/A
   134217728       4194304     float    none      -1   5224.1   25.69   22.48      0   5264.2   25.50   22.31    N/A
   268435456       8388608     float    none      -1    10289   26.09   22.83      0    10334   25.97   22.73    N/A
   536870912      16777216     float    none      -1    20513   26.17   22.90      0    20585   26.08   22.82    N/A
  1073741824      33554432     float    none      -1    40882   26.26   22.98      0    41022   26.17   22.90    N/A
  2147483648      67108864     float    none      -1    81711   26.28   23.00      0    81959   26.20   22.93    N/A
  4294967296     134217728     float    none      -1   163115   26.33   23.04      0   163963   26.19   22.92    N/A

Test results of three devices:

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2617867 on server1 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 2617868 on server1 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 3103671 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 3103672 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 2637126 on server3 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 2637127 on server3 device  2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108848       2796202     float    none      -1   6499.2   10.33    8.60      0   6136.2   10.94    9.11    N/A
   134217720       5592405     float    none      -1    14519    9.24    7.70      0    13511    9.93    8.28    N/A
   268435440      11184810     float    none      -1    26193   10.25    8.54      0    23691   11.33    9.44    N/A
   536870904      22369621     float    none      -1    58246    9.22    7.68      0    54668    9.82    8.18    N/A
  1073741808      44739242     float    none      -1   105248   10.20    8.50      0    93663   11.46    9.55    N/A
  2147483640      89478485     float    none      -1   233191    9.21    7.67      0   221382    9.70    8.08    N/A
  4294967280     178956970     float    none      -1   420496   10.21    8.51      0   395454   10.86    9.05    N/A
@sjeaugey
Copy link
Member

It seems on server2 rank 3 is not using device 2 but device 0 instead. I'm not actually sure how that's possible given rank 2 is also using the same device, but maybe there is an error in the launch script so you end up with 2 ranks using the same NICs?

@fj1425fj
Copy link
Author

Sorry, I made a mistake while editing. Rank3 is using device 2.

@sjeaugey
Copy link
Member

sjeaugey commented Apr 25, 2024

The bad performance might just be misalignment issues. If you look at the number of elements, every other size is aligned to 2 elements and every other is aligned to 1. Given those are floats we're aligned to 4 bytes or 8 bytes, but never 16 which gives good performance.
That's because we divide the total size by the number of ranks, so when you run on numbers of ranks which are not a power of two, you should use a start size that's a multiple of the number of ranks. E.g. -b 3M instead of -b 2M.

@fj1425fj
Copy link
Author

fj1425fj commented Jul 2, 2024

Thank you for your answer. I tested that this phenomenon would not occur if independent IP was used instead of bond. Do you know why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants