Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of Amazon EFA #179

Merged
merged 12 commits into from
Sep 1, 2023
Merged

Add support of Amazon EFA #179

merged 12 commits into from
Sep 1, 2023

Conversation

DwarKapex
Copy link
Contributor

@DwarKapex DwarKapex commented Aug 25, 2023

Address issue: EFA Support #167

The upstream JAX container contains only MOFED NIC support. The MOFED package
from Mellanox that we use installs *ibverbs* libraries which do not contain
libefa*.so which are required for AWS.

A temporary solution is to provide a script as a part of base container
(/usr/local/bin/install-efa.sh) that AWS user can run inside the container
to handle this issue.
The script does the following:
1. Remove all *ibverbs* and RDMA related libraries
2. Download Amazon EFA installer
3. Install EFA

How to use: in the running container run install-efa.sh script:
root@<container-id> $> install-efa.sh

.github/container/install-efa.sh Outdated Show resolved Hide resolved
.github/container/install-efa.sh Outdated Show resolved Hide resolved
@sbhavani
Copy link
Contributor

Do we have support for all the packages required to run on AWS? Here's a reference for customizing an NGC image: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/ubuntu18.04/Dockerfile-cu11-pt20.10.sagemaker

@DwarKapex
Copy link
Contributor Author

Do we have support for all the packages required to run on AWS? Here's a reference for customizing an NGC image: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/ubuntu18.04/Dockerfile-cu11-pt20.10.sagemaker

Not sure, need to test. NGC image that you suggested uses efa-installer from amazon. That was my initial approach, but I found a conflict between efa-installer and MOFED: both installs libibverbs package. Interesting, that MOFED implementation of libibverbs does not contain libefa*, meaning, that if the installation order is MOFED -> efa-installer, MOFED implementation will be overwritten, in other way around efa-instller -> MOFED, libefa* will be unavailable.

@yhtang what is so important in MOFED? Can we use Amazon implementation instead?

.github/container/install-efa.sh Outdated Show resolved Hide resolved
@sbhavani
Copy link
Contributor

I think we should have a separate image or static Dockerfile example for AWS.

I remember seeing that some of our proprietary MOFED bits conflict with Amazon's so it's been difficult to ship both in NGC.

@nouiz
Copy link
Collaborator

nouiz commented Aug 28, 2023

What is our solution for NGC containers?
Can we do the same?

Can we detect that we are on AWS when we launch the container?
If so, we could print a warning and suggested to run a script to replace MOFED with EFA.
We could also provide a docker that have this already run.

But even with a different container, some users will take the wrong one, so if we can warn, it would be great.

@mjsML
Copy link
Member

mjsML commented Aug 28, 2023

Can we add .network in the image name? i.e pax.IB.py3 and pax.EFA.py3 or something similar?

@nouiz
Copy link
Collaborator

nouiz commented Aug 28, 2023

nitpik, it would be on the jax container, so it will propagate to all containers: {jax,t5x,pax}.IB.py3 and {jax,t5x,pax}.EFA.py3

@mjsML
Copy link
Member

mjsML commented Aug 28, 2023

sgtm

@sbhavani
Copy link
Contributor

I don't think we need to call out IB, it's included by default in NGC containers and doesn't need a separate tag. Could we keep the extra tag just for efa?

@andportnoy
Copy link
Contributor

I think we should have a separate image or static Dockerfile example for AWS.

nitpik, it would be on the jax container, so it will propagate to all containers: {jax,t5x,pax}.IB.py3 and {jax,t5x,pax}.EFA.py3

Use multi-stage builds like we do already for the JAX container?

Then we could build two separate images using the same Dockerfile:

docker build -f Dockerfile.base --target generic --tag base-generic .
docker build -f Dockerfile.base --target aws     --tag base-aws     .

As a workaround the strategy is the following:
1. Build the base image with MOFED support installed.
2. Add an extra script to the image (/usr/local/bin/install-efa.sh)
   to be abel to install Amazon EFA per request.
@DwarKapex
Copy link
Contributor Author

Make changes per today's morning discussions:
As a workaround the strategy is the following:

  1. Build the base image with MOFED support installed.
  2. Add an extra script to the image (/usr/local/bin/install-efa.sh) to be able to install Amazon EFA per request.

@yhtang module cmd is an interesting approach, but IMHO in this case is overkill. How to add a release note on How to use it?

@sbhavani
Copy link
Contributor

Make changes per today's morning discussions: As a workaround the strategy is the following:

  1. Build the base image with MOFED support installed.
  2. Add an extra script to the image (/usr/local/bin/install-efa.sh) to be able to install Amazon EFA per request.

@yhtang module cmd is an interesting approach, but IMHO in this case is overkill. How to add a release note on How to use it?

for the how to use it, we should try to keep the library versions as close to what AWS supports for it's DL containers: https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.13/py3/cu118/Dockerfile.gpu

@DwarKapex
Copy link
Contributor Author

for the how to use it, we should try to keep the library versions as close to what AWS supports for it's DL containers: https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.13/py3/cu118/Dockerfile.gpu

I think, they manually increment the version of Amazon EFA installer. Don't know if it is a good idea, but if it's really matter I can keep in sync the DL containers version and ours, not a big deal (@sbhavani what do you think?). As a reference: they use version 1.24.0, we use latest, which is 1.25.0 as of now.

@yhtang
Copy link
Collaborator

yhtang commented Aug 29, 2023

This is at least good for unblocking people that need EFA, so I'd say let's go ahead and merge it.

In the long run, I still believe a more systematic approach, such as the modulefiles solution is better in terms of unifying the container bases. There are potential drawbacks of letting users run the installer script after pulling the image, including

  • making the container no longer self-contained for archival and bug-repro purposes.
  • requires network connection when running the container (may be mitigated by distributing the downloaded installer package, but bloats image)
  • longer pull time, bigger image.
  • etc...

yhtang
yhtang previously approved these changes Aug 29, 2023
@yhtang
Copy link
Collaborator

yhtang commented Aug 29, 2023

Before merging, @DwarKapex could you please update the PR description so that it documents the approach taken, and that it is a stopgap solution to unblock users, while further work is still needed to harden it?

@sbhavani
Copy link
Contributor

Keeping the versions in sync manually is fine, I don't think it changes often (except for new hardware bring up like p5). Thanks for adding this!

DwarKapex and others added 2 commits August 31, 2023 13:50
Issue addressed #167
    
    The upstream JAX container contains only MOFED NIC support. The MOFED package
    from Mellanox that we use installs *ibverbs* libraries which do not contain
    libefa*.so which are required for AWS.
    
    A temporary solution is to provide a script as a part of base container
    (/usr/local/bin/install-efa.sh) that AWS user can run inside the container
    to handle this issue.
    The script does the following:
    1. Remove all *ibverbs* and RDMA related libraries
    2. Download Amazon EFA installer
    3. Install EFA
@yhtang
Copy link
Collaborator

yhtang commented Aug 31, 2023

/assistant summarize the key takeaways of the discussion using concise bullet points

@github-actions
Copy link

  • The EFA Support EFA Support  #167 issue discusses the lack of support of Amazon EFA in the upstream JAX container that only has MOFED NIC support, a problem for AWS users.
  • The author, DwarKapex, proposes a script solution that includes removing all ibverbs and RDMA libraries, downloading the Amazon EFA installer, and installing EFA after running the container.
  • However, concerns arise over compatibility between MOFED and Amazon EFA as they both install the libibverbs package, leading to potential conflicts.
  • A separate image or Dockerfile for AWS use is suggested. There are also suggestions to create different container tags for AWS and IB.
  • Multi-stage builds are suggested by andportnoy to create separate images using the same Dockerfile.
  • DwarKapex confirms the decision to build the base image with MOFED support installed and including an extra script for Amazon EFA installation.
  • To ensure the best container usability, it's suggested to align library versions closely with what AWS supports for its Deep Learning containers.
  • Maintaining versions manually and keeping in sync with DL containers is agreed.
  • Yhtang endorses proceeding with the solution for unblocking users but argues for a more systematic approach such as using modulefiles for future robust resolution and mentions the drawbacks of the current solution.
  • DwarKapex is requested to update the PR description before merging, specifying that current changes are temporary and further hardening is needed.

@yhtang yhtang merged commit d7c3fb9 into main Sep 1, 2023
41 checks passed
@yhtang yhtang deleted the add-efa branch September 1, 2023 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants