Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update E3SM-Project submodule #461

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Nov 19, 2022

This merge updates the E3SM-Project submodule from 6b81271377 to 569ed6b730.

This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):

@xylar
Copy link
Collaborator Author

xylar commented Nov 23, 2022

I'm having trouble tracking down changes so I'm going to write a utility for comparing the results of the pr test suite between subsequent relevant E3SM PR merges and include that in this merge.

@xylar
Copy link
Collaborator Author

xylar commented Nov 30, 2022

@mark-petersen and @jonbob, I would appreciate you having a look at this when you can. I'll comment more on Slack so we can discuss there (or here).

@xylar
Copy link
Collaborator Author

xylar commented Nov 30, 2022

Results from all of these tests are available here:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20221127/update_e3sm_project

I think the subdirectories are straightforward. each has log files that document the validation failures and the size of the differences.

@mark-petersen
Copy link
Collaborator

I think there are only two PR results to resolve here:

On the first, 5195, failure of baroclinic_channel/decomp_test will be resolved by E3SM-Project/E3SM#5356. Now that we understand it and plan to fix it, I think we can proceed with updating past E3SM-Project/E3SM#5195 in compass.

On the second, I tested on cori and with OPENMP=false the nightly test suite passes bfb comparison with previous master for gnu (both debug and optimized) and intel (both debug and optimized). I need to try with OPENMP=true, but this appears to be a mismatch caused by OPENMP.

@mark-petersen
Copy link
Collaborator

Tested intel on chrysalis with OPENMP=true. Comparing before and after E3SM-Project/E3SM#5194 (8b8bbba994 vs 2491723cca):

  • intel optimized: all tests match bfb before/after 5194. ocean/baroclinic_channel/10km/decomp_test fails internal comparison on both, as expected.
  • intel debug: two threads_tests fail validation and baseline compare. All others pass validation and baseline compare. (I don't know why the thread tests fail).

@xylar, you saw numerous baseline mismatches in the first post above for the 5194 merge, which is different from what I see. I double-checked the commits. I don't understand why we are getting different results.

@mark-petersen
Copy link
Collaborator

@xylar, your new testing script looks very useful. Thanks for doing that. Looking in your directory here, I can see the baseline mismatches:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20221127/update_e3sm_project/22_5194_5189/compass_pr.o253780

and I see you are running intel on chrysalis, same as me. I loaded my environment with compass using this commit 6396cfa Date: Thu Dec 8, and my modules and make command are:

Currently Loaded Modules:
  1) cmake/3.19.1-yisciec   3) intel/20.0.4-kodw73g           5) intel-mkl/2020.4.304-g2qaxzf   7) netcdf-c/4.4.1-qvxyzq2         9) parallel-netcdf/1.11.0-b74wv4m
  2) perl/5.32.0-bsnc6lt    4) intel-mpi/2019.9.304-tkzvizk   6) hdf5/1.8.16-se4xyo7            8) netcdf-fortran/4.4.4-rdxohvp

make intel-mpi  USE_PIO2=true OPENMP=true DEBUG=false
make intel-mpi  USE_PIO2=true OPENMP=true DEBUG=true

When I compare the two merge commits:

*   2491723cca (after) Merge branch 'philipwjones/mpas-ocean/rm-device-resident' (PR #5194)
...
*    8b8bbba994 (before) Merge branch 'azamat/cime/shorten-jenkins-paths-for-crusher-amdclang' (PR #5224)

I get bfb match for both intel debug and intel optimized. Are you comparing the merge commits or the branch commits just before the merge? Another difference is that I'm comparing to 8b8bbba994, the previous first-parent commit, and you are comparing to PR 5189, the previous ocean merge. Could that explain the difference in our results? If so, there must be another merge in between 5189 and 5194 that changed the baseline. Could you repeat my test to verify?

@xylar
Copy link
Collaborator Author

xylar commented Dec 12, 2022

@mark-petersen, thank you. I will retest and try to figure out which commit between PR 5189 and 5194 is the actual culprit. I will first retest 5189 vs. 5194 because I just don't see how any of these intermediate PRs could be the cause:

* 2491723cca (22_5194) Merge branch 'philipwjones/mpas-ocean/rm-device-resident' (PR #5194)
* 8b8bbba994 Merge branch 'azamat/cime/shorten-jenkins-paths-for-crusher-amdclang' (PR #5224)
* 27b7d6c5f8 Merge branch 'bishtgautam/datm/jra' (PR #5150)
* a17d4b4fc0 Merge branch 'origin/oksanaguba/homme/spock' (PR #5039)
* 2ad7a59139 Merge branch 'ambrad/hommexx/sl-support-mpi-on-host' (PR #5223)
* f8b135bcf9 (21_5189) Merge branch 'mark-petersen/ocn/harmonic-analysis-enddo' (PR #5189)

@xylar
Copy link
Collaborator Author

xylar commented Dec 13, 2022

@mark-petersen, I'm still baffled. I'm working from a slightly different compass (3b2b18f, which is the branch for #466) I have the same modules:

$ module list

Currently Loaded Modules:
  1) cmake/3.19.1-yisciec           6) hdf5/1.8.16-se4xyo7
  2) perl/5.32.0-bsnc6lt            7) netcdf-c/4.4.1-qvxyzq2
  3) intel/20.0.4-kodw73g           8) netcdf-fortran/4.4.4-rdxohvp
  4) intel-mpi/2019.9.304-tkzvizk   9) parallel-netcdf/1.11.0-b74wv4m
  5) intel-mkl/2020.4.304-g2qaxzf

I have just used environment variables (in the load script) to have:

USE_PIO2=true
OPENMP=true

By not specifying anything, I have DEBUG=false. I just run make clean; make intel-mpi.

I'm seeing the same non-bfb behavior with 5194 vs. 5189 as before. I'm seeing bfb between 5224 and 5189. I am about to test 5194 vs. 5224 but I can't see how that will not be non-bfb. I will update shortly.

@xylar
Copy link
Collaborator Author

xylar commented Dec 13, 2022

Okay, this is truly bizarre! I see BFB between 5194 and 5224, and also between 5224 and 5189 but not between 5194 and 5189. I really don't understand what's going on.

Update: sorry, I jumped the gun. It passed a few baroclinic channel tests cases and I thought that was odd but I now see those were passing before. So I'm seeing the same diffs between 5194 and 5224 as I saw between 5194 and and 5189. They seem to be quite persistent for me.

See:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20221213/e3sm_submodule/22_5194_vs_5189
/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20221213/e3sm_submodule/22_5194_vs_5224

@mark-petersen
Copy link
Collaborator

OK. Maybe I made a mistake. I'll test again.

@mark-petersen
Copy link
Collaborator

Sorry, it was my mistake. I can reproduce your baseline mismatch between 5224 and 5194, using intel optimized on chrysalis. I used compass suite -p and pointed to the wrong directory the first time.

@xylar
Copy link
Collaborator Author

xylar commented Dec 13, 2022

Things are looking good when I use OpenMPI:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20221213/e3sm_submodule/openmpi_22_5194_vs_5224

@xylar xylar removed the request for review from jonbob December 13, 2022 15:58
@xylar
Copy link
Collaborator Author

xylar commented Dec 13, 2022

@mark-petersen, if you're okay with the BFB results I'm seeing for that last test (with OpenMPI instead of Intel-MPI), I think we can move on. We will bring in E3SM-Project/E3SM#5356 as soon as it's merged but this PR doesn't need to wait for that.

Copy link
Collaborator

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that we understand the two non-bfb commits above, 5194 and 5195, we can proceed with this update. Thanks.

@xylar
Copy link
Collaborator Author

xylar commented Dec 13, 2022

Thanks very much for your testing, @mark-petersen and the effort in figuring out these differences. It makes a big difference to me that we understand when things changed and why.

@xylar xylar merged commit 3128f75 into MPAS-Dev:master Dec 13, 2022
@xylar xylar deleted the update_e3sm_project_submodule branch December 13, 2022 16:17
@xylar xylar mentioned this pull request Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants