Adding default XLA/GPU env vars for all JAX-based containers #114

terrykong · 2023-07-11T16:59:56Z

Addresses #105 and #109

--xla_gpu_enable_latency_hiding_scheduler=true: Allows XLA:GPU to move communication collectives to increase overlap with compute kernels.
--xla_gpu_enable_async_all_gather=true: Allows XLA:GPU to run All Gather NCCL kernels on a separate CUDA stream to allow overlap with compute kernels.
--xla_gpu_enable_async_reduce_scatter=true: Allows XLA:GPU to run Reduce Scatter NCCL kernels on a separate CUDA stream to allow overlap with compute kernels.
--xla_gpu_enable_triton_gemm=false: Disallows Trition GeMM kernels; uses CUBLAS GeMM kernels instead. CUBLAS kernels are currently better tuned for GPUs and thus provide better performance.
CUDA_DEVICE_MAX_CONNECTIONS=1: Use a single queue for GPU work, lowers latency of each stream operation. OK since XLA already orders launches.
NCCL_IB_SL=1: defines the InfiniBand Service Level (1)

@nluehr @ashors1

…ove perf

.github/container/Dockerfile.jax

yhtang · 2023-07-17T06:00:27Z

Could someone please edit the PR description to document the rationale behind each flag?

terrykong · 2023-07-17T16:28:09Z

Could someone please edit the PR description to document the rationale behind each flag?

@nluehr @abhinavgoel95 ?

abhinavgoel95 · 2023-07-17T16:44:56Z

--xla_gpu_enable_latency_hiding_scheduler=true: Allows XLA:GPU to move communication collectives to increase overlap with compute kernels.
--xla_gpu_enable_async_all_gather=true: Allows XLA:GPU to run All Gather NCCL kernels on a separate CUDA stream to allow overlap with compute kernels.
--xla_gpu_enable_async_reduce_scatter=true: Allows XLA:GPU to run Reduce Scatter NCCL kernels on a separate CUDA stream to allow overlap with compute kernels.
--xla_gpu_enable_triton_gemm=false: Disallows Trition GeMM kernels, uses CUBLAS GeMM kernels instead.

nluehr · 2023-07-17T17:00:31Z

@abhinavgoel95 what about the following?
NCCL_PROTO=LL128
NCCL_AVOID_RECORD_STREAMS=1
NCCL_IB_SL=1

yhtang · 2023-07-18T06:39:36Z

NCCL_PROTO=LL128:
doc says:

The default is LL,LL128,Simple on platforms which support LL128, LL,Simple otherwise.

Related issue: What is LL128 Protocol? nccl#281
NCCL_AVOID_RECORD_STREAMS=1: undocumented in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html.
NCCL_IB_SL=1
doc says:

Defines the InfiniBand Service Level.
For more information, see the InfiniBand specification Volume 1 (https://www.infinibandta.org/ibta-specifications-download) or vendor documentation.

and the link in the quoted text above is dead.

nluehr · 2023-07-19T19:42:48Z

@abhinavgoel95 How much speedup do we see from setting NCCL_PROTO=LL128 in practice?
The combiner threshold may come into play here as for larger collectives, I think LL128 could reduce effective bandwidth by ~6%.

nluehr · 2023-07-20T20:24:50Z

From my reading of the NCCL code, the only safe way to enable LL128 is to not specify NCCL_PROTO at all.
LL128 is valid only on Volta, Ampere, and Hopper with NVLink. PCIE can cause packet re-ordering that results in silent corruption.
Since we expect users will run our models on a variety of clusters, I think we shouldn't define NCCL_PROTO either in the JAX dockerfile nor in the paxml run scripts. If removing this results in worse performance, we should open a bug with the NCCL team.

ashors1 · 2023-07-21T17:25:15Z

We're also sometimes interested in setting XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 to increase the amount of memory allocated to XLA. Is this something we'd want to consider adding to the base container as well?

nluehr · 2023-07-21T18:33:57Z

Whether we want to set XLA_PYTHON_CLIENT_MEM_FRACTION depends on the workload (particularly whether any other GPU libraries are being used outside of XLA).

My opinion is that this would be OK to set in the paxml container, but for the JAX container in general we shouldn't set it because there we expect users to extend it in varied workflows.

.github/container/Dockerfile.jax

yhtang · 2023-07-25T18:02:36Z

NCCL_AVOID_RECORD_STREAMS is NOT documented anywhere. Shall we open an NVBug?

nluehr · 2023-07-26T16:48:29Z

As best I can tell, NCCL_AVOID_RECORD_STREAMS is a feature of pytorch rather than NCCL. (It seems it's now been renamed to TORCH_NCCL_AVOID_RECORD_STREAMS here).
So I think it's safe to drop it entirely.

yhtang

Did we figure out a source for the undocumented NCCL_AVOID_RECORD_STREAMS variable?

terrykong · 2023-08-15T22:09:59Z

@abhinavgoel95

abhinavgoel95 · 2023-08-15T22:23:24Z

@yhtang @terrykong it is safe to drop NCCL_AVOID_RECORD_STREAMS. We do not need to upstream this. It is a PyTorch specific feature.

terrykong · 2023-08-17T20:14:15Z

I removed mention of that env var. Can I re-request your review @yhtang ?

Adding default env vars for all JAX-based containers that should impr…

2ed414f

…ove perf

terrykong changed the title ~~Adding default env vars for all JAX-based containers that should improve perf~~ Adding default XLA/GPU env vars for all JAX-based containers Jul 11, 2023

terrykong requested review from yhtang and nluehr July 11, 2023 17:01

nluehr reviewed Jul 14, 2023

View reviewed changes

.github/container/Dockerfile.jax Show resolved Hide resolved

terrykong self-assigned this Jul 18, 2023

nluehr reviewed Jul 21, 2023

View reviewed changes

.github/container/Dockerfile.jax Outdated Show resolved Hide resolved

terrykong and others added 2 commits July 25, 2023 09:57

Remove NCCL_PROTO

c156366

Merge branch 'main' into jax-env-vars

dd50ffd

yhtang added 2 commits July 25, 2023 19:04

add documentation for variables and flags

72c7753

Merge branch 'main' into jax-env-vars

77b7466

terrykong assigned yhtang Jul 26, 2023

abhinavgoel95 approved these changes Aug 14, 2023

View reviewed changes

yhtang requested changes Aug 15, 2023

View reviewed changes

terrykong added 3 commits August 17, 2023 13:10

Update Dockerfile.jax by taking out NCCL_AVOID_RECORD_STREAMS

6bc8922

Update README.md by removing NCCL_AVOID_RECORD_STREAMS

ee5b63d

Merge branch 'main' into jax-env-vars

612dd03

terrykong requested a review from yhtang August 17, 2023 20:14

Add comment to Dockerfile

ed108a1

yhtang approved these changes Aug 18, 2023

View reviewed changes

This was linked to issues Aug 18, 2023

Adding new environment variables to speed up nccl collectives in Jax. #105

Closed

XLA_FLAGS to enable LHS should be exported in JAX container #109

Closed

yhtang merged commit 11987e0 into main Aug 18, 2023
36 of 41 checks passed

yhtang deleted the jax-env-vars branch August 18, 2023 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding default XLA/GPU env vars for all JAX-based containers #114

Adding default XLA/GPU env vars for all JAX-based containers #114

terrykong commented Jul 11, 2023 •

edited

Loading

yhtang commented Jul 17, 2023

terrykong commented Jul 17, 2023

abhinavgoel95 commented Jul 17, 2023

nluehr commented Jul 17, 2023

yhtang commented Jul 18, 2023 •

edited

Loading

nluehr commented Jul 19, 2023

nluehr commented Jul 20, 2023

ashors1 commented Jul 21, 2023

nluehr commented Jul 21, 2023

yhtang commented Jul 25, 2023 •

edited

Loading

nluehr commented Jul 26, 2023

yhtang left a comment

terrykong commented Aug 15, 2023

abhinavgoel95 commented Aug 15, 2023

terrykong commented Aug 17, 2023

Adding default XLA/GPU env vars for all JAX-based containers #114

Adding default XLA/GPU env vars for all JAX-based containers #114

Conversation

terrykong commented Jul 11, 2023 • edited Loading

yhtang commented Jul 17, 2023

terrykong commented Jul 17, 2023

abhinavgoel95 commented Jul 17, 2023

nluehr commented Jul 17, 2023

yhtang commented Jul 18, 2023 • edited Loading

nluehr commented Jul 19, 2023

nluehr commented Jul 20, 2023

ashors1 commented Jul 21, 2023

nluehr commented Jul 21, 2023

yhtang commented Jul 25, 2023 • edited Loading

nluehr commented Jul 26, 2023

yhtang left a comment

Choose a reason for hiding this comment

terrykong commented Aug 15, 2023

abhinavgoel95 commented Aug 15, 2023

terrykong commented Aug 17, 2023

terrykong commented Jul 11, 2023 •

edited

Loading

yhtang commented Jul 18, 2023 •

edited

Loading

yhtang commented Jul 25, 2023 •

edited

Loading