-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update E3SM-Project submodule #461
Conversation
I'm having trouble tracking down changes so I'm going to write a utility for comparing the results of the |
@mark-petersen and @jonbob, I would appreciate you having a look at this when you can. I'll comment more on Slack so we can discuss there (or here). |
Results from all of these tests are available here:
I think the subdirectories are straightforward. each has log files that document the validation failures and the size of the differences. |
I think there are only two PR results to resolve here:
On the first, 5195, failure of On the second, I tested on cori and with OPENMP=false the nightly test suite passes bfb comparison with previous master for gnu (both debug and optimized) and intel (both debug and optimized). I need to try with OPENMP=true, but this appears to be a mismatch caused by OPENMP. |
Tested intel on chrysalis with OPENMP=true. Comparing before and after E3SM-Project/E3SM#5194 (8b8bbba994 vs 2491723cca):
@xylar, you saw numerous baseline mismatches in the first post above for the 5194 merge, which is different from what I see. I double-checked the commits. I don't understand why we are getting different results. |
@xylar, your new testing script looks very useful. Thanks for doing that. Looking in your directory here, I can see the baseline mismatches:
and I see you are running intel on chrysalis, same as me. I loaded my environment with compass using this commit 6396cfa Date: Thu Dec 8, and my modules and make command are:
When I compare the two merge commits:
I get bfb match for both intel debug and intel optimized. Are you comparing the merge commits or the branch commits just before the merge? Another difference is that I'm comparing to 8b8bbba994, the previous first-parent commit, and you are comparing to PR 5189, the previous ocean merge. Could that explain the difference in our results? If so, there must be another merge in between 5189 and 5194 that changed the baseline. Could you repeat my test to verify? |
@mark-petersen, thank you. I will retest and try to figure out which commit between PR 5189 and 5194 is the actual culprit. I will first retest 5189 vs. 5194 because I just don't see how any of these intermediate PRs could be the cause:
|
@mark-petersen, I'm still baffled. I'm working from a slightly different compass (3b2b18f, which is the branch for #466) I have the same modules:
I have just used environment variables (in the load script) to have:
By not specifying anything, I have I'm seeing the same non-bfb behavior with 5194 vs. 5189 as before. I'm seeing bfb between 5224 and 5189. I am about to test 5194 vs. 5224 but I can't see how that will not be non-bfb. I will update shortly. |
Update: sorry, I jumped the gun. It passed a few baroclinic channel tests cases and I thought that was odd but I now see those were passing before. So I'm seeing the same diffs between 5194 and 5224 as I saw between 5194 and and 5189. They seem to be quite persistent for me. See:
|
OK. Maybe I made a mistake. I'll test again. |
Sorry, it was my mistake. I can reproduce your baseline mismatch between 5224 and 5194, using intel optimized on chrysalis. I used |
Things are looking good when I use OpenMPI:
|
@mark-petersen, if you're okay with the BFB results I'm seeing for that last test (with OpenMPI instead of Intel-MPI), I think we can move on. We will bring in E3SM-Project/E3SM#5356 as soon as it's merged but this PR doesn't need to wait for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that we understand the two non-bfb commits above, 5194 and 5195, we can proceed with this update. Thanks.
Thanks very much for your testing, @mark-petersen and the effort in figuring out these differences. It makes a big difference to me that we understand when things changed and why. |
This merge updates the E3SM-Project submodule from 6b81271377 to 569ed6b730.
This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):
name_in_output
registry attribute to variables E3SM-Project/E3SM#5120gregorian_noleap
to justnoleap
calendar E3SM-Project/E3SM#5162baroclinic_channel/decomp_test
fails. This was reported in Ocean fails stand-alone decomp test, intel optimized E3SM-Project/E3SM#5219 though it was originally attributed to 5099, not this PR. I believe this needs to be fixed.baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195, even though this PR fixed some openmp directives from 4195.baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195baroclinic_channel/decomp_test
fails as in 5195