Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

terrykong · 2023-09-05T21:51:53Z

This is a large addition so the highlights are summarized below:

Adds a new _triage.yaml workflow that gets called if T5x MGMN or Pax MGMN tests fail. It relies on a :latest-verified tag as the last known working image.
The latest-verified tag is added to the vanilla t5x and pax images if their respective tests pass
Adds Dockerfile.ff which creates a new image from a BASE_IMAGE where all (or some) of the dependencies are fast forwarded to the versions in BROKEN_IMAGE. The image layer produced by this is small since no recompilation happens. Thus, this dockerfile is currently only suitable for fast-forwarding libraries with python source only (e.g., t5x, paxml, praxis).
- Additionally, a ff-summary script is added to help inspect the source offline.
To signal failure or success, a new job called outcome was added to _test_t5x.yaml and _test_pax.yaml in order to detect if the overall test was a success
Adds some utility functions for workflows in this file: .github/workflows/util.sh; which can be reused on the command line or in other workflows. (Was helpful in debugging the triage workflow)
Removes the endpoint artifact upload in _publish_badge.yaml since it was causing collisions on multiple re-runs of the same workflow in a matrix. The artifact isn't really needed anyway since we upload it as a gist and it's printed in the logs.
Files a github issue if there are failures:
- For pax: [Bot] pax test failures on 2023-07-08 #215
- For t5x: [Bot] t5x test failures on 2023-07-20 #218

to triage dependency issues

framework

nouiz · 2023-09-05T23:01:44Z

Should we update some user doc about this?

terrykong · 2023-09-06T00:23:46Z

@nouiz Added a user doc here: https://github.com/NVIDIA/JAX-Toolbox/blob/triage-tool/docs/triage.md

terrykong · 2023-09-06T00:27:10Z

Here are two "sandbox" runs to demonstrate what example triaging runs look like:

it triage.md

terrykong · 2023-09-12T06:35:06Z

I've also updated the triage tool to auto file a github issue if there is a failure. Here are two examples:

For pax: [Bot] pax test failures on 2023-07-08 #215
For t5x: [Bot] t5x test failures on 2023-07-20 #218

nouiz · 2023-09-12T18:35:05Z

I have questions about the created github issues:
Does this create issue only for nightly and not for PR?
If the same failures happens for many days, does it creates new issues? Append info to the existing one?

nouiz · 2023-09-12T18:35:21Z

Also, how does it select who to assign the issue to?

terrykong · 2023-09-12T18:44:36Z

Also, how does it select who to assign the issue to?

Right now it's hard-coded: https://github.com/NVIDIA/JAX-Toolbox/pull/208/files#diff-2ad6ab3e3b9d04131794c79a52e5f18dc271a4bcbf5d6c08694a956e1a48e287R51-R54

Does this create issue only for nightly and not for PR?

It only creates it for the nightly. The pre-sumit CI for PRs will not trigger the triage or the github issue filing.

If the same failures happens for many days, does it creates new issues? Append info to the existing one?

If the failure goes on for many days, it will create new issues for each day it fails. Another option is to have one issue, and then it can be closed if an engineer deems it as "fixed", and then the triaging workflow can re-open it and add a comment with the new "summary table".

I thought creating new issues would be preferred to keep conversations self-contained since each issue may be different; but the tradeoff is there's more housekeeping everyone has to do to make sure our issue page is not cluttered with these bot triages.

nouiz · 2023-09-12T18:58:26Z

SG to me as a start. Clearly better then what we have now.
We could just check each day if the bug is the same and close the extra issue as duplicate.
If we are able to fix them fast enough, it shouldn't be too much overhead.

.github/workflows/_sandbox.yaml

yhtang · 2023-09-14T05:50:40Z

.github/workflows/scripts/inspect_remote_img.sh

+  PACKAGE=$(echo $IMAGE_REPO | rev | cut -d/ -f1 | rev)
+  ORG=$(echo $IMAGE_REPO | rev | cut -d/ -f2 | rev)
+
+  top_manifest_digest=$(curl -s -H "Authorization: Bearer $(echo $GH_TOKEN | base64)" "https://ghcr.io/v2/$ORG/$PACKAGE/manifests/$TAG" | jq -r .manifests[0].digest)


This might not work in the future when we start to build multi-arch containers for PAX and T5X, since manifests[] will contain at least two manifests corresponding to the two architectures, respectively.

Gotcha. Do you have an example of an image with mutliple manifests I can use as a reference to update this code to select the right manifest?

Or are you suggesting we just merge in and fix later (create GH issue)

Issue created: #263.

terrykong · 2023-09-20T19:35:22Z

The two failures can be ignored:

CI / test-pax / outcome: is not caused by this PR, but rather surfaced b/c of this PR. So it's result should not block this PR
CI / build-rosetta-t5x / build: is known to be broken on main b/c the TE PR in t5x has a merge conflict which we are resolving as we speak.

terrykong added 21 commits September 5, 2023 11:38

Adds Dockerfile.ff capable of taking two images and building a new one

a6d4a9d

to triage dependency issues

STATUS -> TEST_STATUS

0911ec3

only test 1 gpu for speed

2d8bfed

try metadata-action

be4d5ed

fix

3d54311

docker login

805e420

Weren't getting all tags before

ff53da4

log stuff

eb7e5e2

wrong runner tags

dfe5484

try failure

d3085aa

wip

ca89fd5

actually run something

46cb007

simplify with new summary

e3eeaf4

fix stuff

d43cbf7

add pax branch

75d7e81

Sets outcome job

f2d9a12

add triaging everywhere

2ec2458

fix condition

f6139e1

cleanup

63a4407

Revert sandbox and fix condition to skip ff summaries for the other

3f19e6f

framework

re-enable t5x tests

6886b12

terrykong marked this pull request as draft September 5, 2023 23:46

terrykong added 3 commits September 5, 2023 16:53

Add REPO_DIRS to specify only paxml, praxis, flax, and t5x

1ecd393

Add triaging user doc

40c6827

nit

6c13eca

terrykong requested review from sharathts and yhtang September 6, 2023 00:40

terrykong added 8 commits September 6, 2023 11:20

Add the failing image to the table just for completeness

416a216

Add outcome interpretation to _triage.yaml and document how to interpret

1dea780

it triage.md

re-enable sandbox to create example runs

0e9a0bd

update sandbox with github issue update and fail-fast=false everywhere

ccd0e61

move scripts to their own dir and fix pagination issue of jobs

4efe4c6

make outcome always run

8e576fe

fixes typo

5984811

add file issue everywhere

eed24c8

terrykong marked this pull request as ready for review September 12, 2023 06:36

terrykong added 2 commits September 11, 2023 23:36

Merge branch 'main' into triage-tool

536fcd4

update docs

ae13c5f

sharathts previously approved these changes Sep 12, 2023

View reviewed changes

yhtang previously approved these changes Sep 14, 2023

View reviewed changes

Reset sandbox

10cb6da

terrykong dismissed stale reviews from yhtang and sharathts via 10cb6da September 18, 2023 15:47

terrykong added 2 commits September 18, 2023 08:49

Merge branch 'main' into triage-tool

6d575dc

Reformat workflow scripts so description is in function

d369da2

yhtang approved these changes Sep 26, 2023

View reviewed changes

yhtang merged commit 04f240b into main Sep 26, 2023
44 of 46 checks passed

yhtang deleted the triage-tool branch September 26, 2023 06:29

yhtang mentioned this pull request Sep 26, 2023

Platform-aware extraction of container image digest #263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

terrykong commented Sep 5, 2023 •

edited

Loading

nouiz commented Sep 5, 2023

terrykong commented Sep 6, 2023

terrykong commented Sep 6, 2023 •

edited

Loading

terrykong commented Sep 12, 2023

nouiz commented Sep 12, 2023

nouiz commented Sep 12, 2023

terrykong commented Sep 12, 2023

nouiz commented Sep 12, 2023

yhtang Sep 14, 2023

terrykong Sep 18, 2023

yhtang Sep 26, 2023

terrykong commented Sep 20, 2023

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

Conversation

terrykong commented Sep 5, 2023 • edited Loading

nouiz commented Sep 5, 2023

terrykong commented Sep 6, 2023

terrykong commented Sep 6, 2023 • edited Loading

terrykong commented Sep 12, 2023

nouiz commented Sep 12, 2023

nouiz commented Sep 12, 2023

terrykong commented Sep 12, 2023

nouiz commented Sep 12, 2023

yhtang Sep 14, 2023

Choose a reason for hiding this comment

terrykong Sep 18, 2023

Choose a reason for hiding this comment

yhtang Sep 26, 2023

Choose a reason for hiding this comment

terrykong commented Sep 20, 2023

terrykong commented Sep 5, 2023 •

edited

Loading

terrykong commented Sep 6, 2023 •

edited

Loading