Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use pip-compile to help with consistent Python dependency resolution (#…
…371) # Summary - All Python packages, except for a few build dependencies, are now installed using **pip-tools**. - The JAX and upstream T5X/PAX containers are now built in a two-stage procedure: 1. The **'meal kit'** stage: source packages are downloaded, wheels built if necessary (for TE, tensorflow-text, lingvo, etc.), but **no** package is installed. Instead, manifest files are created in the `/opt/pip-tools.d` folder to instruct which packages shall be installed by pip-tools. The stage is named due to its similarity in how ingredients in a meal kit are prepared while deferring the final cooking step. 2. The **'final'** (cooking🔥) stage: this is when pip-tools collectively compile the manifests from the various container layers and then sync-install everything to exactly match the resolved versions. - Note that downstream containers will **build on top of the meal kit image of its base container**, thus ensuring all packages and dependencies are installed exactly once to avoid conflicts and image bloating. - The meal kit and final images are published as - mealkit: `ghcr.io/nvidia/image:mealkit` and `ghcr.io/nvidia/image:mealkit-YYYY-MM-DD` - final: `ghcr.io/nvidia/image:latest` and `ghcr.io/nvidia/image:nightly-YYYY-MM-DD` # Additional changes to the workflows - `/opt/jax-source` is renamed to `/opt/jax`. The `-source` suffix is only added to packages that needs compilation, e.g. XLA and TE. - The CI workflow is now matricized against CPU arch. - The reusable `_build_*.yaml` workflows are simplified to build only one image for a single architecture at a time. The logic for creating multi-arch images is relocated into the `_publish_container.yaml` workflows and involved during the nightly runs only. - TE is now built as a wheel and shipped in the JAX core meal kit image. - TE unit tests will be performed using the upstream-pax image due to the dependency on praxis. - Build workflows now produce sitreps following the paradigm of #229. - Removed the various one-off workflows for pinned CUDA/JAX versions. - Refactored the PAX arm64 Dockerfile in preparation for #338 # What remains to be done - [ ] Update the Rosetta container build + test process to use the upstream T5X/PAX mealkit (ghcr.io/nvidia/upstream-t5x:mealkit, ghcr.io/nvidia/upstream-pax:mealkit) containers # Reviewing tips This PR requires a multitude of reviewers due to its size and scope. I'd truly appreciate code owners to review any changes related to their previous contributions. An incomplete list of reviewer-scope is: - @terrykong, @ashors1, @sharathts, @maanug-nv: Rosetta, TE, T5X and PAX MGMN tests - @nouiz: JAX, TE and T5X build - @joker-eph: PAX arm64 build - @nluehr: Base image, NCCL, PAX - @DwarKapex: base/JAX/XLA build, workflow logic Closes #223 Closes #230 Closes #231 Closes #232 Closes #233 Closes #271 Fixes #328 Fixes #337 Co-authored-by: Terry Kong <terryk@nvidia.com> --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Vladislav Kozlov <vkozlov@nvidia.com>
- Loading branch information