-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208
Conversation
to triage dependency issues
Should we update some user doc about this? |
@nouiz Added a user doc here: https://github.com/NVIDIA/JAX-Toolbox/blob/triage-tool/docs/triage.md |
Here are two "sandbox" runs to demonstrate what example triaging runs look like: |
I've also updated the triage tool to auto file a github issue if there is a failure. Here are two examples: |
I have questions about the created github issues: |
Also, how does it select who to assign the issue to? |
Right now it's hard-coded: https://github.com/NVIDIA/JAX-Toolbox/pull/208/files#diff-2ad6ab3e3b9d04131794c79a52e5f18dc271a4bcbf5d6c08694a956e1a48e287R51-R54
It only creates it for the nightly. The pre-sumit CI for PRs will not trigger the triage or the github issue filing.
If the failure goes on for many days, it will create new issues for each day it fails. Another option is to have one issue, and then it can be closed if an engineer deems it as "fixed", and then the triaging workflow can re-open it and add a comment with the new "summary table". I thought creating new issues would be preferred to keep conversations self-contained since each issue may be different; but the tradeoff is there's more housekeeping everyone has to do to make sure our issue page is not cluttered with these bot triages. |
SG to me as a start. Clearly better then what we have now. |
PACKAGE=$(echo $IMAGE_REPO | rev | cut -d/ -f1 | rev) | ||
ORG=$(echo $IMAGE_REPO | rev | cut -d/ -f2 | rev) | ||
|
||
top_manifest_digest=$(curl -s -H "Authorization: Bearer $(echo $GH_TOKEN | base64)" "https://ghcr.io/v2/$ORG/$PACKAGE/manifests/$TAG" | jq -r .manifests[0].digest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not work in the future when we start to build multi-arch containers for PAX and T5X, since manifests[]
will contain at least two manifests corresponding to the two architectures, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Do you have an example of an image with mutliple manifests I can use as a reference to update this code to select the right manifest?
Or are you suggesting we just merge in and fix later (create GH issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created: #263.
The two failures can be ignored:
|
This is a large addition so the highlights are summarized below:
_triage.yaml
workflow that gets called if T5x MGMN or Pax MGMN tests fail. It relies on a:latest-verified
tag as the last known working image.latest-verified
tag is added to the vanilla t5x and pax images if their respective tests passDockerfile.ff
which creates a new image from aBASE_IMAGE
where all (or some) of the dependencies are fast forwarded to the versions inBROKEN_IMAGE
. The image layer produced by this is small since no recompilation happens. Thus, this dockerfile is currently only suitable for fast-forwarding libraries with python source only (e.g., t5x, paxml, praxis).ff-summary
script is added to help inspect the source offline.outcome
was added to _test_t5x.yaml and _test_pax.yaml in order to detect if the overall test was a success.github/workflows/util.sh
; which can be reused on the command line or in other workflows. (Was helpful in debugging the triage workflow)