Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a triage job that triggers if pax MGMN or t5x MGMN tests fail #208

Merged
merged 37 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
a6d4a9d
Adds Dockerfile.ff capable of taking two images and building a new one
terrykong Jul 20, 2023
0911ec3
STATUS -> TEST_STATUS
terrykong Aug 29, 2023
2d8bfed
only test 1 gpu for speed
terrykong Aug 29, 2023
be4d5ed
try metadata-action
terrykong Aug 29, 2023
3d54311
fix
terrykong Aug 30, 2023
805e420
docker login
terrykong Aug 30, 2023
ff53da4
Weren't getting all tags before
terrykong Aug 30, 2023
eb7e5e2
log stuff
terrykong Aug 30, 2023
dfe5484
wrong runner tags
terrykong Aug 30, 2023
d3085aa
try failure
terrykong Aug 30, 2023
ca89fd5
wip
terrykong Aug 30, 2023
46cb007
actually run something
terrykong Aug 30, 2023
e3eeaf4
simplify with new summary
terrykong Sep 5, 2023
d43cbf7
fix stuff
terrykong Sep 5, 2023
75d7e81
add pax branch
terrykong Sep 5, 2023
f2d9a12
Sets outcome job
terrykong Sep 5, 2023
2ec2458
add triaging everywhere
terrykong Sep 5, 2023
f6139e1
fix condition
terrykong Sep 5, 2023
63a4407
cleanup
terrykong Sep 5, 2023
3f19e6f
Revert sandbox and fix condition to skip ff summaries for the other
terrykong Sep 5, 2023
6886b12
re-enable t5x tests
terrykong Sep 5, 2023
1ecd393
Add REPO_DIRS to specify only paxml, praxis, flax, and t5x
terrykong Sep 5, 2023
40c6827
Add triaging user doc
terrykong Sep 6, 2023
6c13eca
nit
terrykong Sep 6, 2023
416a216
Add the failing image to the table just for completeness
terrykong Sep 6, 2023
1dea780
Add outcome interpretation to _triage.yaml and document how to interpret
terrykong Sep 6, 2023
0e9a0bd
re-enable sandbox to create example runs
terrykong Sep 6, 2023
ccd0e61
update sandbox with github issue update and fail-fast=false everywhere
terrykong Sep 8, 2023
4efe4c6
move scripts to their own dir and fix pagination issue of jobs
terrykong Sep 11, 2023
8e576fe
make outcome always run
terrykong Sep 11, 2023
5984811
fixes typo
terrykong Sep 11, 2023
eed24c8
add file issue everywhere
terrykong Sep 12, 2023
536fcd4
Merge branch 'main' into triage-tool
terrykong Sep 12, 2023
ae13c5f
update docs
terrykong Sep 12, 2023
10cb6da
Reset sandbox
terrykong Sep 18, 2023
6d575dc
Merge branch 'main' into triage-tool
terrykong Sep 18, 2023
d369da2
Reformat workflow scripts so description is in function
terrykong Sep 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions .github/container/Dockerfile.ff
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# syntax=docker/dockerfile:1-labs
###############################################################################
## This is a development Dockerfile that fast-fowards dependencies to allow
## testing last HEAD changes of a few dependencies in an older image
##
## Any linear scan or bisection happens outside of this image, since the
## test to determine whether the image is functional or not may depend on
## more sophisticated testing process, e.g., submitting to slurm.
###############################################################################

# The broken image is used to extract metadata
ARG BROKEN_IMAGE
# The base image that we are going to fast-forward
ARG BASE_IMAGE

FROM ${BROKEN_IMAGE} AS broken

RUN <<"EOF" bash -e
echo $BUILD_DATE >/build_date
for repo in $(find /opt -mindepth 1 -maxdepth 1 -type d); do
if [[ ! -d $repo/.git ]]; then
continue
fi
echo -e "$repo\t$(git -C $repo rev-parse HEAD)" >>/ff.txt
done
EOF

FROM ${BASE_IMAGE} AS ff-image
# Space separated string where each item is a repo dir JAX-Toolbox installed
# Example:
# --build-arg REPO_DIRS="/opt/t5x /opt/flax"
ARG REPO_DIRS=""

COPY --from=broken /build_date /build_date
COPY --from=broken /ff.txt /ff.txt

RUN <<"EOF" bash -e
ALL_DIRS=${REPO_DIRS}
if [[ -z "$ALL_DIRS" ]]; then
ALL_DIRS="$(find /opt -mindepth 1 -maxdepth 1 -type d)"
fi
for repo in $ALL_DIRS; do
if [[ ! -d $repo/.git ]]; then
continue
fi
ff_git_ref=$(fgrep $repo /ff.txt | cut -f2)
if [[ -z "$ff_git_ref" ]]; then
echo "[ERROR]: There is no commit for $repo to FF to:"
cat /ff.txt
exit 1
fi
cd $repo
# Create a branch for reference of the previous HEAD commit
git branch --force previous-HEAD HEAD
git branch --force $BUILD_DATE previous-HEAD # alias
# Grab latest update from remote, since FF commit is likely farther ahead then current main/HEAD
git fetch -a
# Checkout a new branch at this FF git ref
git checkout -b ff-to-$(cat /build_date) $ff_git_ref
done
EOF

COPY --chmod=755 <<"EOF" /usr/local/bin/ff-summary
#!/bin/bash

for repo in $(find /opt -mindepth 1 -maxdepth 1 -type d | sort); do
if [[ ! -d $repo/.git ]]; then
continue
fi
cd $repo
SUMMARY=""
if ! git show previous-HEAD >/dev/null 2>&1; then
SUMMARY="(UNCHANGED)"
elif [[ $(git rev-parse HEAD) == $(git rev-parse previous-HEAD) ]]; then
SUMMARY="(HEAD == previous-HEAD)"
fi
echo "======================================================================================="
echo "[Repo]: $repo $SUMMARY"
echo "======================================================================================="
echo "**********"
echo "** HEAD **"
echo "**********"
git -C $repo show --quiet --format="commit %H%d%nAuthor: %an <%ae>%nCommit: %cn <%ce>%nDate: %ad%n%n%s%n%b" HEAD
if [[ "$SUMMARY" == "(UNCHANGED)" ]]; then
continue
fi
echo "*******************"
echo "** previous-HEAD **"
echo "*******************"
git -C $repo show --quiet --format="commit %H%d%nAuthor: %an <%ae>%nCommit: %cn <%ce>%nDate: %ad%n%n%s%n%b" previous-HEAD
done
EOF
6 changes: 0 additions & 6 deletions .github/workflows/_publish_badge.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,6 @@ jobs:
EOF
) | tee ${{ inputs.ENDPOINT_FILENAME }}

- name: Upload badge artifact
uses: actions/upload-artifact@v3
with:
name: ${{ inputs.ENDPOINT_FILENAME }}
path: ${{ inputs.ENDPOINT_FILENAME }}

- name: Update status badge file in gist
uses: actions/github-script@v6
if: inputs.PUBLISH
Expand Down
11 changes: 11 additions & 0 deletions .github/workflows/_test_pax.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -265,3 +265,14 @@ jobs:

EOF
) | tee $GITHUB_STEP_SUMMARY

outcome:
needs: publish-test
runs-on: ubuntu-22.04
if: ( always() )
steps:
- name: Sets workflow status based on test outputs
run: |
if [[ ${{ needs.publish-test.outputs.STATUS }} != success ]]; then
exit 1
fi
23 changes: 18 additions & 5 deletions .github/workflows/_test_t5x.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -282,12 +282,14 @@ jobs:
FAILED_TESTS=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
TOTAL_TESTS=$(ls $EXIT_STATUSES | wc -l)

echo '## T5x MGMN+SPMD Test Status' >> $GITHUB_STEP_SUMMARY
cat <<EOF >>$GITHUB_STEP_SUMMARY
## T5x MGMN+SPMD Test Status
| Test Case | State | Exit Code |
| --- | --- | --- |
EOF
for i in $EXIT_STATUSES; do
echo $i | cut -d'.' -f1
echo '```json'
jq . $i
echo '```'
# Files are named <GHID>-<NAME>/<NAME>-status.json
echo "| $(echo $i | cut -d/ -f1 | cut -d- -f2) | $(jq -r .state $i) | $(jq -r .exitcode $i)"
done | tee -a $GITHUB_STEP_SUMMARY

echo "Test statuses:"
Expand Down Expand Up @@ -322,3 +324,14 @@ jobs:

EOF
) | tee $GITHUB_STEP_SUMMARY

outcome:
needs: publish-test
runs-on: ubuntu-22.04
if: ( always() )
steps:
- name: Sets workflow status based on test outputs
run: |
if [[ ${{ needs.publish-test.outputs.STATUS }} != success ]]; then
exit 1
fi
Loading
Loading