Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pip-compile to help with consistent Python dependency resolution #371

Merged
merged 151 commits into from
Nov 21, 2023

Conversation

yhtang
Copy link
Collaborator

@yhtang yhtang commented Nov 10, 2023

Summary

  • All Python packages, except for a few build dependencies, are now installed using pip-tools.
  • The JAX and upstream T5X/PAX containers are now built in a two-stage procedure:
    1. The 'meal kit' stage: source packages are downloaded, wheels built if necessary (for TE, tensorflow-text, lingvo, etc.), but no package is installed. Instead, manifest files are created in the /opt/pip-tools.d folder to instruct which packages shall be installed by pip-tools. The stage is named due to its similarity in how ingredients in a meal kit are prepared while deferring the final cooking step.
    2. The 'final' (cooking🔥) stage: this is when pip-tools collectively compile the manifests from the various container layers and then sync-install everything to exactly match the resolved versions.
  • Note that downstream containers will build on top of the meal kit image of its base container, thus ensuring all packages and dependencies are installed exactly once to avoid conflicts and image bloating.
  • The meal kit and final images are published as
    • mealkit: ghcr.io/nvidia/image:mealkit and ghcr.io/nvidia/image:mealkit-YYYY-MM-DD
    • final: ghcr.io/nvidia/image:latest and ghcr.io/nvidia/image:nightly-YYYY-MM-DD

Additional changes to the workflows

  • /opt/jax-source is renamed to /opt/jax. The -source suffix is only added to packages that needs compilation, e.g. XLA and TE.
  • The CI workflow is now matricized against CPU arch.
  • The reusable _build_*.yaml workflows are simplified to build only one image for a single architecture at a time. The logic for creating multi-arch images is relocated into the _publish_container.yaml workflows and involved during the nightly runs only.
  • TE is now built as a wheel and shipped in the JAX core meal kit image.
  • TE unit tests will be performed using the upstream-pax image due to the dependency on praxis.
  • Build workflows now produce sitreps following the paradigm of Migrate to the sitrep system #229.
  • Removed the various one-off workflows for pinned CUDA/JAX versions.
  • Refactored the PAX arm64 Dockerfile in preparation for merge x86 and ARM for pax #338

What remains to be done

  • Update the Rosetta container build + test process to use the upstream T5X/PAX mealkit (ghcr.io/nvidia/upstream-t5x:mealkit, ghcr.io/nvidia/upstream-pax:mealkit) containers

Reviewing tips

This PR requires a multitude of reviewers due to its size and scope. I'd truly appreciate code owners to review any changes related to their previous contributions. An incomplete list of reviewer-scope is:

Closes #223
Closes #230
Closes #231
Closes #232
Closes #233
Closes #271
Fixes #328
Fixes #337

Co-authored-by: Terry Kong terryk@nvidia.com

@terrykong
Copy link
Contributor

Created issue to track the VCS installs: #384 @DwarKapex and @yhtang

rosetta/Dockerfile.t5x Outdated Show resolved Hide resolved
rosetta/Dockerfile.pax Outdated Show resolved Hide resolved
rosetta/Dockerfile.t5x Outdated Show resolved Hide resolved
@DwarKapex DwarKapex merged commit ca0b396 into main Nov 21, 2023
84 of 86 checks passed
@DwarKapex DwarKapex deleted the add-pip-compile branch November 21, 2023 06:41
terrykong pushed a commit that referenced this pull request Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants