You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool.
When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.
Have you ever encountered this phenomenon?
What are the possible reasons for this phenomenon? Looking forward to your reply.
It seems on server2 rank 3 is not using device 2 but device 0 instead. I'm not actually sure how that's possible given rank 2 is also using the same device, but maybe there is an error in the launch script so you end up with 2 ranks using the same NICs?
The bad performance might just be misalignment issues. If you look at the number of elements, every other size is aligned to 2 elements and every other is aligned to 1. Given those are floats we're aligned to 4 bytes or 8 bytes, but never 16 which gives good performance.
That's because we divide the total size by the number of ranks, so when you run on numbers of ranks which are not a power of two, you should use a start size that's a multiple of the number of ranks. E.g. -b 3M instead of -b 2M.
RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool.
When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.
Have you ever encountered this phenomenon?
What are the possible reasons for this phenomenon? Looking forward to your reply.
the nccl-tests result is following
mpirun --allow-run-as-root --host xxxx -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_GID_INDEX=3 -x LD_LIBRARY_PATH=/root/nccl-bond/build/lib:$LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME==bond0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=136 -x NCCL_IB_HCA==mlx5_bond_0 -x NCCL_P2P_DISABLE=1 -x NCCL_SHM_DISABLE=1 /home/test/nccl-tests/build/alltoall_perf -b 2M -e 4096M -f 2 -g 2 -n 20
Test results of four devices:
Test results of three devices:
The text was updated successfully, but these errors were encountered: