-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update WW3 for PIO/netCDF restarts #2445
base: develop
Are you sure you want to change the base?
Update WW3 for PIO/netCDF restarts #2445
Conversation
at cc70186, the following files do not compare rt_cpld_mpi_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL rt_cpld_mpi_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL rt_cpld_restart_bmark_p8_intel.log: Comparing ufs.cpld.ww3.r.2013-04-01-21600 .....USING CMP......NOT IDENTICAL rt_cpld_restart_c192_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-43200 .....USING CMP......NOT IDENTICAL
*add trho fix to w3iors, these ww3.r files do not compare *tested against bl.trhofix rt_cpld_mpi_gfsv17_intel.log:Test cpld_mpi_gfsv17_intel FAIL rt_cpld_mpi_pdlib_p8_intel.log:Test cpld_mpi_pdlib_p8_intel FAIL rt_cpld_restart_bmark_p8_intel.log:Test cpld_restart_bmark_p8_intel FAIL rt_cpld_restart_c192_p8_intel.log:Test cpld_restart_c192_p8_intel FAIL
* no write/read of fpis. these ww3.r files do not compare. tested against bl.trhofix.nofpis. all other files compare b4b rt_cpld_mpi_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL rt_cpld_mpi_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL
* fix typo in use_historync * remove mediator_present flag (unneeded)
* following pass baseline cpld_debug_noaero_p8 cpld_debug_pdlib_p8 hafs_regional_storm_following_1nest_atm_ocn_wav_mom6
* tested all wave-containing tests with modifications for restart file naming to allow for the custom filenaming of binary restarts. This feature is present in the current WW3 code, but will be removed once we enable netcdf restarts. Temporary code was added to allow the binary restart to have the existing format of casename+ww3.r+timestring. With this modification, all baselines were B4B.
* ww3 hash 4674dae passes against a self-generated baseline except for cpld_restart_gfsv17_intel * compare cmeps restart files of this uwm-hash against current baseline at develop-20240904. All are identical except for cpld_control_gfsv17_iau_intel
* ww3 0ad634c9 still fails slow restart, even though my sandbox testing passed.
* additional restart fields for WW3/slow loop coupling are requested via ww3 nml setting
Code is crashing on Gaea. Experiment path: /gpfs/f5/epic/scratch/Jong.Kim/RT_RUNDIRS/Jong.Kim/FV3_RT/rt_2431768/atmwav_control_noaero_p8_intel
develop branch runs ok on gaea. |
@jkbk2004 I can't see your run directory on Gaea. I'll do my own test. |
@jkbk2004 Did all other tests run to completion on Gaea? |
@DeniseWorthen cpld_debug_pdlib_p8_intel fails on gaea as well. pretty much same error messages. Runs on all other machines are ok. |
Hm, that is very odd, because I tested extensively on Gaea in debug for the unstructured mesh. But all my testing was prior to the upgrade. It is failing at a call to piosync in the atmwav test, which makes me think it might be a platform issue. I'm also seeing a lot of sticky behaviour w/ file system (worse than normal for Gaea) since the upgrade. The atmw test will run with |
I've also gotten the atmwav test to run w/ subset but w/ increased resources. We're not using everything we're requesting right now (the job_card requests 256 and we're only using 180). Bumping the ww3 resources a bit resolves the issue w/ this test. Still debugging the debug test---which is running close to it's wall clock anyway. |
* increased resources on gaea for cpld_debug_pdlib_p8 and atmwav_control_noaero_p8 * switched debug test to use box rearranger
@jkbk2004 I've made platform specific modifications to the two tests on Gaea and run successfully. |
I continue to see more failures. All libpthread-2.31.s error. Looks like some impact on none wave hafs cases. It may need to adjust with general resource increase thru the change on gaea TPN. Note that develop branch runs ok.
|
@jkbk2004 As I said earlier, I can't see into your Gaea run directories. I will need to repeat the tests on Gaea. |
@jkbk2004 If you have another PR ready, please move on w/ it and give me time to debug. |
@jkbk2004 Also, are these failures Gaea specific, or are you seeing similar failures elsewhere? |
@DeniseWorthen Issue is gaea specific. I think the issue might be resolved with resource increase like TPN=84 or 96. I think gaea default TPN=128. |
@jkbk2004 The test failures that do not include waves do not make any sense. There should not be any impact on a non-wave containing test. These three tests are not different than the develop branch.
|
* to resolve failures on gaea, the default ww3 rearranger is set as box on all platforms. All tests run with current resources w/ slight bump for the debug_pdlib case. A followup issue will be created for the rearranger failure on Gaea and Gaea SAs will be contacted.
@jkbk2004 I've reset the default rearranger for WW3 to |
@jkbk2004 What is the scheduled commit date for this PR? |
@DeniseWorthen wcoss2 is on maintenance this week. so we decided to let PRs with no baseline change go first. 11/11 is veterans day. We can schedule this PR on 11/12. |
Commit Queue Requirements:
Description:
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Changes
Regression Test Changes (Please commit test_changes.list):
New Baselines are required for all tests which include the WAV component. Answers do not change, but the comparison lists will now include a WW3 netCDF restart file. Note we do not currently compare the WW3 binary restart files for any global coupled test because they don't in general reproduce themselves.
To verify no answer changes, the WW3 restarts were temporarily removed from comparison lists but with netcdf restarts written and used for restart tests. All baselines passed against the develop-20240909 on hercules at 0b0a048
I've continued to test this PR against the current develop branch using the method of temporarily removing the netCDF WW3 restart files from the comparison lists. This feature branch has continued to pass as the final changes were made to the WW3 feature branch, most recently using 79cfd42.
I've also created a baseline using this PR at the above hash and verified against it. In this case, the netCDF restart files are being compared. All baselines pass.
In testing, it was found that Hercules+GNU failed for the subset rearranger, but worked for box. The relevant tests were switched to box only for Hercules+GNU tests. To verify that the problem is a platform (Hercules) issue, GNU tests were then run on Derecho against a self-baseline and all tests passed at 677cfd9.
On Hercules, a full RT
test_changes.list
has been committed. Examining the log files shows test failues are due to missing netCDF WW3 restarts. For these tests, no files were found to 'not compare'.Input data Changes:
Testing Log: