Skip to content

Commit

Permalink
Add triaging user doc
Browse files Browse the repository at this point in the history
  • Loading branch information
terrykong committed Sep 6, 2023
1 parent 1ecd393 commit 40c6827
Showing 1 changed file with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions docs/triage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Triage Workflow

There is a Github Action Workflow called [_triage.yaml](../.github/workflows/_triage.yaml) that can
be used to help determine if a test failure was due to a change in (t5x or pax) or further-up, e.g., in (Jax or CUDA). This workflow is not the end-all, and further investigation is usually needed,
but this automates the investigation of questions like "what state of library X works with Jax at state Y?"


## Algorithm
The pseudocode for the triaging algorithm is as follows:
```python
# Broken pax + jax
BROKEN_NIGHTLY = 'ghcr.io/nvidia/pax:nightly-YYYY-MM-05'
# Working pax + jax
WORKING_NIGHTLY = 'ghcr.io/nvidia/pax:nightly-YYYY-MM-01'

for container between(WORKING_NIGHTLY, BROKEN_NIGHTLY):
new_container = fast_forward_pax_in(container)
test_result = run_pax_tests_on(new_container)
if test_result == "Pass":
return "Suspect: Newer Jax containers"
else:
return "Suspect: New change in pax"
```

__Note__: Since we are working with mutliple repositories, we cannot use binary-search to search over
the containers because the assumption that the test_results for all containers between the working and broken is monotonic, is not guaranteed. So the only logical choice is to linearly scan thru the
images between `WORKING_NIGHTLY` and `BROKEN_NIGHTLY`.

## How to use it
There are two ways the triage workflow can be used:

1. As a [re-usable workflow](https://docs.github.com/en/actions/using-workflows/reusing-workflows)
(example: [nightly-pax-test-mgmn.yaml](../.github/workflows/nightly-pax-test-mgmn.yaml)). Existing
workflows will trigger the `_triage.yaml` workflow if the tests fail.
2. Or triggered from the web-ui: [here](https://github.com/NVIDIA/JAX-Toolbox/actions/workflows/_triage.yaml).

### Inspecting the output
After the job is finished, you can inspect the summary of the run and there should be a table
like [this](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6089563249#summary-16523914207) for pax
or like [this](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6087387492#summary-16516484677) for t5x.

Both should show a table like this:
| Rewind to | Test result | Image |
| --- | --- | --- |
| nightly-2023-07-18 | success | ghcr.io/nvidia/jax-toolbox-internal:6087387492-nightly-2023-07-18-ff-t5x-to-2023-07-20 |
| nightly-2023-07-19 | success | ghcr.io/nvidia/jax-toolbox-internal:6087387492-nightly-2023-07-19-ff-t5x-to-2023-07-20 |

Where "Rewind to" is which nightly we started from and then fast-forwarded the libraries to;
"Test result" is the updated test result with this new fast-forwarded image; and "Image" is
the updated image with fast-fowarded code.

0 comments on commit 40c6827

Please sign in to comment.