NVIDIA · yhtang · Aug 18, 2023 · Jul 11, 2023 · Jul 25, 2023 · Jul 25, 2023
diff --git a/.github/container/Dockerfile.jax b/.github/container/Dockerfile.jax
@@ -47,6 +47,10 @@ ARG SRC_PATH_JAX
 ARG SRC_PATH_XLA
 ARG BUILD_DATE
 ENV BUILD_DATE=${BUILD_DATE}
+# The following environment variables tune performance
+ENV XLA_FLAGS="--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_async_all_gather=true --xla_gpu_enable_async_reduce_scatter=true --xla_gpu_enable_triton_gemm=false"
+ENV CUDA_DEVICE_MAX_CONNECTIONS=1
+ENV NCCL_IB_SL=1
 
 COPY --from=jax-builder ${SRC_PATH_JAX}-no-git ${SRC_PATH_JAX}
 COPY --from=jax-builder ${SRC_PATH_XLA}-no-git ${SRC_PATH_XLA}
@@ -73,4 +77,4 @@ ARG SRC_PATH_XLA
 ADD build-jax.sh local_cuda_arch test-jax.sh /usr/local/bin/
 
 COPY --from=jax-builder ${SRC_PATH_JAX}/.git ${SRC_PATH_JAX}/.git
-COPY --from=jax-builder ${SRC_PATH_XLA}/.git ${SRC_PATH_XLA}/.git
+COPY --from=jax-builder ${SRC_PATH_XLA}/.git ${SRC_PATH_XLA}/.git
diff --git a/README.md b/README.md
@@ -160,3 +160,19 @@ We currently enable training and evaluation for the following models:
 | [t5(t5x)](./rosetta/rosetta/projects/t5x) | ✔️ | ✔️ | ✔️ |
 
 We will update this table as new models become available, so stay tuned.
+
+## Environment Variables
+
+The [JAX image](ghcr.io/nvidia/jax) is embedded with the following flags and environment variables for performance tuning:
+
+| XLA Flags | Value | Explanation |
+| --------- | ----- | ----------- |
+| `--xla_gpu_enable_latency_hiding_scheduler` | `true`  | allows XLA to move communication collectives to increase overlap with compute kernels |
+| `--xla_gpu_enable_async_all_gather` | `true` | allows XLA to run NCCL [AllGather](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html#allgather) kernels on a separate CUDA stream to allow overlap with compute kernels |
+| `--xla_gpu_enable_async_reduce_scatter` | `true` | allows XLA to run NCCL [ReduceScatter](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html#reducescatter) kernels on a separate CUDA stream to allow overlap with compute kernels |
+| `--xla_gpu_enable_triton_gemm` | `false` | use cuBLAS instead of Trition GeMM kernels |
+
+| Environment Variable | Value | Explanation |
+| -------------------- | ----- | ----------- |
+| `CUDA_DEVICE_MAX_CONNECTIONS` | `1` | use a single queue for GPU work to lower latency of stream operations; OK since XLA already orders launches |
+| `NCCL_IB_SL` | `1` | defines the InfiniBand Service Level ([1](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-sl)) |