-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of Amazon EFA #179
Conversation
Do we have support for all the packages required to run on AWS? Here's a reference for customizing an NGC image: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/ubuntu18.04/Dockerfile-cu11-pt20.10.sagemaker |
Not sure, need to test. NGC image that you suggested uses efa-installer from amazon. That was my initial approach, but I found a conflict between efa-installer and MOFED: both installs libibverbs package. Interesting, that MOFED implementation of libibverbs does not contain libefa*, meaning, that if the installation order is MOFED -> efa-installer, MOFED implementation will be overwritten, in other way around efa-instller -> MOFED, libefa* will be unavailable. @yhtang what is so important in MOFED? Can we use Amazon implementation instead? |
I think we should have a separate image or static Dockerfile example for AWS. I remember seeing that some of our proprietary MOFED bits conflict with Amazon's so it's been difficult to ship both in NGC.
|
What is our solution for NGC containers? Can we detect that we are on AWS when we launch the container? But even with a different container, some users will take the wrong one, so if we can warn, it would be great. |
Can we add .network in the image name? i.e pax.IB.py3 and pax.EFA.py3 or something similar? |
nitpik, it would be on the jax container, so it will propagate to all containers: {jax,t5x,pax}.IB.py3 and {jax,t5x,pax}.EFA.py3 |
sgtm |
I don't think we need to call out IB, it's included by default in NGC containers and doesn't need a separate tag. Could we keep the extra tag just for efa? |
Use multi-stage builds like we do already for the JAX container? Then we could build two separate images using the same Dockerfile: docker build -f Dockerfile.base --target generic --tag base-generic .
docker build -f Dockerfile.base --target aws --tag base-aws . |
As a workaround the strategy is the following: 1. Build the base image with MOFED support installed. 2. Add an extra script to the image (/usr/local/bin/install-efa.sh) to be abel to install Amazon EFA per request.
Make changes per today's morning discussions:
@yhtang module cmd is an interesting approach, but IMHO in this case is overkill. How to add a release note on How to use it? |
for the how to use it, we should try to keep the library versions as close to what AWS supports for it's DL containers: https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.13/py3/cu118/Dockerfile.gpu |
I think, they manually increment the version of Amazon EFA installer. Don't know if it is a good idea, but if it's really matter I can keep in sync the DL containers version and ours, not a big deal (@sbhavani what do you think?). As a reference: they use version 1.24.0, we use |
This is at least good for unblocking people that need EFA, so I'd say let's go ahead and merge it. In the long run, I still believe a more systematic approach, such as the
|
Before merging, @DwarKapex could you please update the PR description so that it documents the approach taken, and that it is a stopgap solution to unblock users, while further work is still needed to harden it? |
Keeping the versions in sync manually is fine, I don't think it changes often (except for new hardware bring up like p5). Thanks for adding this! |
Issue addressed #167 The upstream JAX container contains only MOFED NIC support. The MOFED package from Mellanox that we use installs *ibverbs* libraries which do not contain libefa*.so which are required for AWS. A temporary solution is to provide a script as a part of base container (/usr/local/bin/install-efa.sh) that AWS user can run inside the container to handle this issue. The script does the following: 1. Remove all *ibverbs* and RDMA related libraries 2. Download Amazon EFA installer 3. Install EFA
…e/builder/#environment-replacement) to ensure appending to environment variables
/assistant summarize the key takeaways of the discussion using concise bullet points |
|
Address issue: EFA Support #167
How to use: in the running container run install-efa.sh script:
root@<container-id> $> install-efa.sh