Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

Merged
merged 37 commits into from
Sep 26, 2023

Conversation

terrykong
Copy link
Contributor

@terrykong terrykong commented Sep 5, 2023

This is a large addition so the highlights are summarized below:

  • Adds a new _triage.yaml workflow that gets called if T5x MGMN or Pax MGMN tests fail. It relies on a :latest-verified tag as the last known working image.
  • The latest-verified tag is added to the vanilla t5x and pax images if their respective tests pass
  • Adds Dockerfile.ff which creates a new image from a BASE_IMAGE where all (or some) of the dependencies are fast forwarded to the versions in BROKEN_IMAGE. The image layer produced by this is small since no recompilation happens. Thus, this dockerfile is currently only suitable for fast-forwarding libraries with python source only (e.g., t5x, paxml, praxis).
    • Additionally, a ff-summary script is added to help inspect the source offline.
  • To signal failure or success, a new job called outcome was added to _test_t5x.yaml and _test_pax.yaml in order to detect if the overall test was a success
  • Adds some utility functions for workflows in this file: .github/workflows/util.sh; which can be reused on the command line or in other workflows. (Was helpful in debugging the triage workflow)
  • Removes the endpoint artifact upload in _publish_badge.yaml since it was causing collisions on multiple re-runs of the same workflow in a matrix. The artifact isn't really needed anyway since we upload it as a gist and it's printed in the logs.
  • Files a github issue if there are failures:

@nouiz
Copy link
Collaborator

nouiz commented Sep 5, 2023

Should we update some user doc about this?

@terrykong terrykong marked this pull request as draft September 5, 2023 23:46
@terrykong
Copy link
Contributor Author

@terrykong
Copy link
Contributor Author

terrykong commented Sep 6, 2023

Here are two "sandbox" runs to demonstrate what example triaging runs look like:

@terrykong
Copy link
Contributor Author

I've also updated the triage tool to auto file a github issue if there is a failure. Here are two examples:

@terrykong terrykong marked this pull request as ready for review September 12, 2023 06:36
sharathts
sharathts previously approved these changes Sep 12, 2023
@nouiz
Copy link
Collaborator

nouiz commented Sep 12, 2023

I have questions about the created github issues:
Does this create issue only for nightly and not for PR?
If the same failures happens for many days, does it creates new issues? Append info to the existing one?

@nouiz
Copy link
Collaborator

nouiz commented Sep 12, 2023

Also, how does it select who to assign the issue to?

@terrykong
Copy link
Contributor Author

Also, how does it select who to assign the issue to?

Right now it's hard-coded: https://github.com/NVIDIA/JAX-Toolbox/pull/208/files#diff-2ad6ab3e3b9d04131794c79a52e5f18dc271a4bcbf5d6c08694a956e1a48e287R51-R54

Does this create issue only for nightly and not for PR?

It only creates it for the nightly. The pre-sumit CI for PRs will not trigger the triage or the github issue filing.

If the same failures happens for many days, does it creates new issues? Append info to the existing one?

If the failure goes on for many days, it will create new issues for each day it fails. Another option is to have one issue, and then it can be closed if an engineer deems it as "fixed", and then the triaging workflow can re-open it and add a comment with the new "summary table".

I thought creating new issues would be preferred to keep conversations self-contained since each issue may be different; but the tradeoff is there's more housekeeping everyone has to do to make sure our issue page is not cluttered with these bot triages.

@nouiz
Copy link
Collaborator

nouiz commented Sep 12, 2023

SG to me as a start. Clearly better then what we have now.
We could just check each day if the bug is the same and close the extra issue as duplicate.
If we are able to fix them fast enough, it shouldn't be too much overhead.

yhtang
yhtang previously approved these changes Sep 14, 2023
.github/workflows/_sandbox.yaml Outdated Show resolved Hide resolved
PACKAGE=$(echo $IMAGE_REPO | rev | cut -d/ -f1 | rev)
ORG=$(echo $IMAGE_REPO | rev | cut -d/ -f2 | rev)

top_manifest_digest=$(curl -s -H "Authorization: Bearer $(echo $GH_TOKEN | base64)" "https://ghcr.io/v2/$ORG/$PACKAGE/manifests/$TAG" | jq -r .manifests[0].digest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work in the future when we start to build multi-arch containers for PAX and T5X, since manifests[] will contain at least two manifests corresponding to the two architectures, respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Do you have an example of an image with mutliple manifests I can use as a reference to update this code to select the right manifest?

Or are you suggesting we just merge in and fix later (create GH issue)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #263.

@terrykong terrykong dismissed stale reviews from yhtang and sharathts via 10cb6da September 18, 2023 15:47
@terrykong
Copy link
Contributor Author

The two failures can be ignored:

  • CI / test-pax / outcome: is not caused by this PR, but rather surfaced b/c of this PR. So it's result should not block this PR
  • CI / build-rosetta-t5x / build: is known to be broken on main b/c the TE PR in t5x has a merge conflict which we are resolving as we speak.

@yhtang yhtang merged commit 04f240b into main Sep 26, 2023
44 of 46 checks passed
@yhtang yhtang deleted the triage-tool branch September 26, 2023 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants