NCCL_ALGO on multi-node and multi-GPU #215

MajidSalimi · 2024-05-20T13:49:06Z

Hi.

I have been running NCCL_TESTS on a multi-node, multi-GPU environment with NCCL 2.19.3-1 and OpenMPI 4.1.6. Each node has 4 NVIDIA V100 GPUs interconnected with NVLink and PCIe.

How is the NCCL_ALGO chosen by default, and what is the decision logic for choosing the algorithms for inter-node and intra-node communications?
If I specify NCCL_ALGO=Ring and at the same time set the OMPI_MCA_coll_tuned_use_dynamic_rules=1 and set an algorithm for coll_tuned_allreduce_algorithm, how the final algorithm will be chosen? Does it go with the NCCL one or the MCA one? Or maybe one is chosen for inter-node and the other for intra-node?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-05-21T07:03:04Z

We have an internal model which compares the performance of the different algorithms and (hopefully) chooses the best one.
You're mixing up NCCL and MPI. The OMPI_ setting controls MPI and NCCL does not use MPI (even for inter-node communication). MPI is only used by the NCCL tests to spawn tasks and help with the CPU-CPU synchronization, but it's not required by NCCL, at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL_ALGO on multi-node and multi-GPU #215

NCCL_ALGO on multi-node and multi-GPU #215

MajidSalimi commented May 20, 2024

sjeaugey commented May 21, 2024

NCCL_ALGO on multi-node and multi-GPU #215

NCCL_ALGO on multi-node and multi-GPU #215

Comments

MajidSalimi commented May 20, 2024

sjeaugey commented May 21, 2024