Add GPU CI/CD (#253)

* add yaml files to gh workflows Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * edit spacing Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * no cache dir Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * cmake Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fasttext wheel Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * python3 dev Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * get update Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * c installs Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * setuptools pip upgrade Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * use stable rapids Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove wheel see what happens Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * edit readme and remove autolabel for now Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add container logic Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add dockerfile and oliver's other suggestions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix run format Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * forked repo url Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * docker run with all gpus Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove running container Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * Update .github/workflows/gpuci.yml Co-authored-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * re add test Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * debug attempt Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove it Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add library path Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * remove nvcc check Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more debugging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * specify curator dir Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more debugging Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * try pytorch container Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * use rapids container Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix RUN instructions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add comments and review suggestions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update runners Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * move args Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com>
NVIDIA · Oct 9, 2024 · 0bbdc06 · 0bbdc06
1 parent cf47c9f
commit 0bbdc06
Show file tree

Hide file tree

Showing 5 changed files with 208 additions and 4 deletions.
diff --git a/.github/workflows/_build_container.yml b/.github/workflows/_build_container.yml
@@ -0,0 +1,99 @@
+name: Build NeMo Curator container
+on:
+  # This script is called by "gpuci.yaml"
+  # We specify a Git reference to checkout, defaulting to the SHA of the commit that triggered the workflow
+  workflow_call:
+    inputs:
+      ref:
+        description: Git ref to checkout
+        default: ${{ github.sha }}
+        required: false
+        type: string
+
+defaults:
+  # Sets default options for executing shell commands in the workflow
+  # `-x` enables debugging output
+  # `-e` ensures that the workflow fails fast on errors
+  # `-u` treats unset variables as errors
+  # `-o pipefail` ensures that any failures in a pipeline are detected
+  run:
+    shell: bash -x -e -u -o pipefail {0}
+
+jobs:
+  main:
+    # This is the tag on our Azure runner found in Actions -> Runners -> Self-hosted runners
+    # It is designated for build jobs
+    runs-on: self-hosted-azure-builder
+    steps:
+        # Checks out the repository code using the actions/checkout action,
+        # storing it in a directory named after the unique workflow run ID
+        # It checks out the specific commit or branch based on the input sha provided when the workflow is called
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          path: ${{ github.run_id }}
+          ref: ${{ inputs.sha }}
+
+        # Cleans up unused Docker resources that haven't been used in the last 24 hours
+      - name: Clean runner cache
+        run: |
+          docker system prune --filter "until=24h" --force
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+        with:
+          # We use `docker` driver as this speeds things up for
+          # trivial (non-multi-stage) builds.
+          driver: docker
+
+        # Pull cached Docker images from a specified Azure Container Registry
+        # It first attempts to pull an image with a tag based on the current PR number (if available) and defaults to buildcache if not
+        # It then tries to pull the buildcache image regardless of the outcome of the previous command
+        # The use of || true allows the workflow to continue even if one or both pull commands fail,
+        # which ensures that the workflow can proceed without interruption
+      - name: Pull cache images
+        run: |
+          docker pull nemoci.azurecr.io/nemo_curator_container:${{ github.event.pull_request.number || 'buildcache' }} || true
+          docker pull nemoci.azurecr.io/nemo_curator_container:buildcache || true
+
+      - name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          # Specifies the path to the Dockerfile to use for building the Docker image (located in the root of the repository)
+          file: Dockerfile
+          # The built image should be pushed to the container registry after it is successfully built
+          push: true
+          # Specifies build arguments that can be passed into the Dockerfile
+          # `FORKED_REPO_URL` is the URL to the user's forked repository
+          # `CURATOR_COMMIT` is the PR's head SHA if available; otherwise, it falls back to the current commit SHA
+          build-args: |
+            FORKED_REPO_URL=https://github.com/${{ github.event.pull_request.head.repo.full_name }}.git
+            CURATOR_COMMIT=${{ github.event.pull_request.head.sha || github.sha }}
+          # Specifies the images to use as cache sources during the build process
+          cache-from: |
+            nemoci.azurecr.io/nemo_curator_container:${{ github.event.pull_request.number || 'buildcache' }}
+            nemoci.azurecr.io/nemo_curator_container:buildcache
+          # Inline caching allows the cache to be available for future builds without needing to push it to a separate repository
+          cache-to: type=inline
+          # Specifies the tag under which the built image will be pushed to the container registry
+          # Uses the "github.run_id" to ensure that each build has a unique tag
+          tags: nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }}
+
+        # Updates the Docker image associated with a PR by tagging the built image with the PR number
+        # and then pushing that tagged image to the Azure Container Registry
+      - name: Update PR image
+        if: github.event_name == 'pull_request'
+        run: |
+          docker tag nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} nemoci.azurecr.io/nemo_curator_container:${{ github.event.pull_request.number }}
+          docker push nemoci.azurecr.io/nemo_curator_container:${{ github.event.pull_request.number }}
+
+      - name: Update buildcache image
+        # Only executes when there is a push to the main branch
+        # Ensures that the build cache is updated only for stable versions of the codebase
+        if: github.ref == 'refs/heads/main'
+        # Updates the Docker image tagged as the build cache by:
+        # 1. Tagging the built image from the current workflow run with the buildcache tag, and
+        # 2. Pushing that tagged image to the Azure Container Registry
+        run: |
+          docker tag nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} nemoci.azurecr.io/nemo_curator_container:buildcache
+          docker push nemoci.azurecr.io/nemo_curator_container:buildcache
diff --git a/.github/workflows/gpuci.yml b/.github/workflows/gpuci.yml
@@ -0,0 +1,72 @@
+name: "GPU CI/CD"
+
+on:
+  pull_request:
+    branches:
+      # We can run gpuCI on any PR targeting these branches
+      - 'main'
+      - '[rv][0-9].[0-9].[0-9]'
+      - '[rv][0-9].[0-9].[0-9]rc[0-9]'
+    # PR has to be labeled with "gpuCI" label
+    # If new commits are added, the "gpuCI" label has to be removed and re-added to rerun gpuCI
+    types: [ labeled ]
+
+jobs:
+  # First, we build and push a NeMo-Curator container
+  build-container:
+    # "build-container" job is run if the "gpuci" label is added to the PR
+    if: ${{ github.event.label.name == 'gpuci' }}
+    uses: ./.github/workflows/_build_container.yml
+
+  # Then, we run our PyTests in the container we just built
+  run-gpu-tests:
+    needs: build-container
+    # This is the tag on our Azure runner found in Actions -> Runners -> Self-hosted runners
+    # It has 2 A100 GPUs
+    runs-on: self-hosted-azure
+    # "run-gpu-tests" job is run if the "gpuci" label is added to the PR
+    if: ${{ github.event.label.name == 'gpuci' }}
+
+    steps:
+      # If something went wrong during the last cleanup, this step ensures any existing container is removed
+    - name: Remove existing container if it exists
+      run: |
+        if [ "$(docker ps -aq -f name=nemo-curator-container)" ]; then
+            docker rm -f nemo-curator-container
+        fi
+
+      # This runs the container which was pushed by build-container, which we call "nemo-curator-container"
+      # `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
+      # We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
+      # `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
+    - name: Run Docker container
+      run: |
+        docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
+
+      # Expect `whoami` to be "azureuser"
+      # Expect `nvidia-smi` to show our 2 A100 GPUs
+    - name: Check GPUs
+      run: |
+        whoami
+        docker exec nemo-curator-container nvidia-smi
+
+      # In the virtual environment (called "curator") we created in the container,
+      # list all of our packages. Useful for debugging
+    - name: Verify installations
+      run: |
+        docker exec nemo-curator-container conda run -n curator pip list
+
+      # In the virtual environment (called "curator") we created in the container,
+      # run our PyTests marked with `@pytest.mark.gpu`
+      # We specify the `rootdir` to help locate the "pyproject.toml" file (which is in the root directory of the repository),
+      # and then the directory where the PyTests are located
+    - name: Run PyTests with GPU mark
+      run: |
+        docker exec nemo-curator-container conda run -n curator pytest -m gpu --rootdir /opt/NeMo-Curator /opt/NeMo-Curator/tests
+
+      # After running `docker stop`, the container remains in an exited state
+      # It is still present on our system and could be restarted with `docker start`
+      # Thus, we use `docker rm` to permanently removed it from the system
+    - name: Cleanup
+      run: |
+        docker stop nemo-curator-container && docker rm nemo-curator-container
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,33 @@
+# See https://github.com/rapidsai/ci-imgs for ARG options
+# NeMo Curator requires Python 3.10, Ubuntu 22.04/20.04, and CUDA 12 (or above)
+ARG CUDA_VER=12.5.1
+ARG LINUX_VER=ubuntu22.04
+ARG PYTHON_VER=3.10
+FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER}
+
+WORKDIR /opt
+
+# Install the minimal libcu* libraries needed by NeMo Curator
+RUN conda create -y --name curator -c conda-forge -c nvidia \
+  python=3.10 \
+  cuda-cudart \
+  libcufft \
+  libcublas \
+  libcurand \
+  libcusparse \
+  libcusolver
+
+# Needed to navigate to and pull the forked repository's changes
+ARG FORKED_REPO_URL
+ARG CURATOR_COMMIT
+
+# Clone the user's repository, find the relevant commit, and install everything we need
+RUN bash -exu <<EOF
+  git clone $FORKED_REPO_URL
+  cd NeMo-Curator
+  git fetch origin $CURATOR_COMMIT --depth=1
+  git checkout $CURATOR_COMMIT
+  source activate curator
+  pip install --upgrade cython pytest pip
+  pip install --extra-index-url https://pypi.nvidia.com ".[all]"
+EOF
diff --git a/README.md b/README.md
@@ -131,9 +131,9 @@ pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
     pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
     ```
 
-#### Using Nightly Dependencies for Rapids
+#### Using Nightly Dependencies for RAPIDS
 
-You can also install NeMo Curator using the Rapids nightly, to do so you can set the environment variable `RAPIDS_NIGHTLY=1`.
+You can also install NeMo Curator using the [RAPIDS Nightly Builds](https://docs.rapids.ai/install). To do so, you can set the environment variable `RAPIDS_NIGHTLY=1`.
 
 
 ```bash
@@ -144,7 +144,7 @@ RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsa
 RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple ".[cuda12x]"
 ```
 
-When the environment variable set to 0 or not set (default behavior) it'll use the stable version of Rapids.
+When the `RAPIDS_NIGHTLY` variable is set to 0 (which is the default), it will use the stable version of RAPIDS.
 
 #### From the NeMo Framework Container
 

diff --git a/nemo_curator/utils/import_utils.py b/nemo_curator/utils/import_utils.py
@@ -346,7 +346,7 @@ def gpu_only_import(module, *, alt=None):
 
     return safe_import(
         module,
-        msg=f"{module} is not enabled in non GPU-enabled installations or environemnts. {GPU_INSTALL_STRING}",
+        msg=f"{module} is not enabled in non GPU-enabled installations or environments. {GPU_INSTALL_STRING}",
         alt=alt,
     )