Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WW3 for PIO/netCDF restarts #2445

Open
wants to merge 65 commits into
base: develop
Choose a base branch
from

Conversation

DeniseWorthen
Copy link
Collaborator

@DeniseWorthen DeniseWorthen commented Sep 24, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Commit Message:

* UFSWM - add PIO settings to WAV attributes in ufs.configure templates
* UFSWM - update ww3_shel.nml to allow for the ice field to be written to the restart file when required (i.e, waves in the slow loop)
* UFSWM - add WW3 restart files to comparison lists
  * WW3 - Add netCDF PIO capability for restarts and run-time history

Priority:

  • High - required for GFSv17

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:


Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Updates/Changes Baselines.

New Baselines are required for all tests which include the WAV component. Answers do not change, but the comparison lists will now include a WW3 netCDF restart file. Note we do not currently compare the WW3 binary restart files for any global coupled test because they don't in general reproduce themselves.

To verify no answer changes, the WW3 restarts were temporarily removed from comparison lists but with netcdf restarts written and used for restart tests. All baselines passed against the develop-20240909 on hercules at 0b0a048

I've continued to test this PR against the current develop branch using the method of temporarily removing the netCDF WW3 restart files from the comparison lists. This feature branch has continued to pass as the final changes were made to the WW3 feature branch, most recently using 79cfd42.

I've also created a baseline using this PR at the above hash and verified against it. In this case, the netCDF restart files are being compared. All baselines pass.

In testing, it was found that Hercules+GNU failed for the subset rearranger, but worked for box. The relevant tests were switched to box only for Hercules+GNU tests. To verify that the problem is a platform (Hercules) issue, GNU tests were then run on Derecho against a self-baseline and all tests passed at 677cfd9.

rt_cpld_control_nowave_noaero_p8_gnu.log:Test cpld_control_nowave_noaero_p8_gnu PASS
rt_cpld_control_p8_gnu.log:Test cpld_control_p8_gnu PASS
rt_cpld_control_pdlib_p8_gnu.log:Test cpld_control_pdlib_p8_gnu PASS
rt_cpld_debug_p8_gnu.log:Test cpld_debug_p8_gnu PASS
rt_cpld_debug_pdlib_p8_gnu.log:Test cpld_debug_pdlib_p8_gnu PASS

On Hercules, a full RT test_changes.list has been committed. Examining the log files shows test failues are due to missing netCDF WW3 restarts. For these tests, no files were found to 'not compare'.

rt_atmwav_control_noaero_p8_intel.log: Comparing ufs.atmw.ww3.r.2021-03-22-64800.nc ............MISSING baseline
rt_cpld_2threads_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_bmark_p8_intel.log: Comparing ufs.cpld.ww3.r.2013-04-01-21600.nc ............MISSING baseline
rt_cpld_control_c192_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-43200.nc ............MISSING baseline
rt_cpld_control_ciceC_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_noaero_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_p8.v2.sfc_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_p8_faster_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_p8_mixedmode_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_pdlib_p8_gnu.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_control_qr_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_debug_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-22-32400.nc ............MISSING baseline
rt_cpld_debug_noaero_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-22-32400.nc ............MISSING baseline
rt_cpld_debug_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-22-32400.nc ............MISSING baseline
rt_cpld_debug_pdlib_p8_gnu.log: Comparing ufs.cpld.ww3.r.2021-03-22-32400.nc ............MISSING baseline
rt_cpld_debug_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-22-32400.nc ............MISSING baseline
rt_cpld_decomp_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_mpi_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_cpld_mpi_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600.nc ............MISSING baseline
rt_hafs_regional_atm_ocn_wav_intel.log: Comparing ufs.hafs.ww3.r.2019-08-29-21600.nc ............MISSING baseline
rt_hafs_regional_atm_wav_intel.log: Comparing ufs.hafs.ww3.r.2019-08-29-21600.nc ............MISSING baseline

Input data Changes:

  • None

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

DeniseWorthen and others added 30 commits July 27, 2024 15:14
at cc70186, the following files do not compare

rt_cpld_mpi_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL
rt_cpld_mpi_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL
rt_cpld_restart_bmark_p8_intel.log: Comparing ufs.cpld.ww3.r.2013-04-01-21600 .....USING CMP......NOT IDENTICAL
rt_cpld_restart_c192_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-43200 .....USING CMP......NOT IDENTICAL
*add trho fix to w3iors, these ww3.r files do not compare
*tested against bl.trhofix

rt_cpld_mpi_gfsv17_intel.log:Test cpld_mpi_gfsv17_intel FAIL
rt_cpld_mpi_pdlib_p8_intel.log:Test cpld_mpi_pdlib_p8_intel FAIL
rt_cpld_restart_bmark_p8_intel.log:Test cpld_restart_bmark_p8_intel FAIL
rt_cpld_restart_c192_p8_intel.log:Test cpld_restart_c192_p8_intel FAIL
* no write/read of fpis. these ww3.r files do not compare. tested
against bl.trhofix.nofpis. all other files compare b4b

rt_cpld_mpi_gfsv17_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL
rt_cpld_mpi_pdlib_p8_intel.log: Comparing ufs.cpld.ww3.r.2021-03-23-21600 .....USING CMP......NOT IDENTICAL
* fix typo in use_historync
* remove mediator_present flag (unneeded)
* following pass baseline
cpld_debug_noaero_p8
cpld_debug_pdlib_p8
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6
* tested all wave-containing tests with modifications for restart
file naming to allow for the custom filenaming of binary restarts.
This feature is present in the current WW3 code, but will be removed
once we enable netcdf restarts. Temporary code was added to allow the
binary restart to have the existing format of casename+ww3.r+timestring.
With this modification, all baselines were B4B.
* ww3 hash 4674dae passes against a self-generated baseline except
for cpld_restart_gfsv17_intel
* compare cmeps restart files of this uwm-hash against current baseline
at develop-20240904. All are identical except for cpld_control_gfsv17_iau_intel
* ww3 0ad634c9 still fails slow restart, even though my
sandbox testing passed.
* additional restart fields for WW3/slow loop coupling are
requested via ww3 nml setting
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Oct 25, 2024

Code is crashing on Gaea. Experiment path: /gpfs/f5/epic/scratch/Jong.Kim/RT_RUNDIRS/Jong.Kim/FV3_RT/rt_2431768/atmwav_control_noaero_p8_intel

+ srun --label -n 256 ./fv3.exe
151: forrtl: severe (174): SIGSEGV, segmentation fault occurred
151: Image              PC                Routine            Line        Source
151: libpthread-2.31.s  00007FDC23EAD910  Unknown               Unknown  Unknown
151: libmpi_intel.so.1  00007FDC22E141C2  Unknown               Unknown  Unknown
srun: error: c5n1298: task 151: Exited with exit code 174

develop branch runs ok on gaea.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 I can't see your run directory on Gaea. I'll do my own test.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 Did all other tests run to completion on Gaea?

@jkbk2004
Copy link
Collaborator

@jkbk2004 Did all other tests run to completion on Gaea?

@DeniseWorthen cpld_debug_pdlib_p8_intel fails on gaea as well. pretty much same error messages. Runs on all other machines are ok.

@DeniseWorthen
Copy link
Collaborator Author

Hm, that is very odd, because I tested extensively on Gaea in debug for the unstructured mesh. But all my testing was prior to the upgrade. It is failing at a call to piosync in the atmwav test, which makes me think it might be a platform issue. I'm also seeing a lot of sticky behaviour w/ file system (worse than normal for Gaea) since the upgrade. The atmw test will run with box, I'll try the same fix for the pdlib debug.

@DeniseWorthen
Copy link
Collaborator Author

I've also gotten the atmwav test to run w/ subset but w/ increased resources. We're not using everything we're requesting right now (the job_card requests 256 and we're only using 180). Bumping the ww3 resources a bit resolves the issue w/ this test. Still debugging the debug test---which is running close to it's wall clock anyway.

* increased resources on gaea for cpld_debug_pdlib_p8 and
atmwav_control_noaero_p8
* switched debug test to use box rearranger
@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 I've made platform specific modifications to the two tests on Gaea and run successfully.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Oct 28, 2024

I continue to see more failures. All libpthread-2.31.s error. Looks like some impact on none wave hafs cases. It may need to adjust with general resource increase thru the change on gaea TPN. Note that develop branch runs ok.

hafs_regional_1nest_atm_intel failed in run_test
hafs_regional_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
hafs_regional_telescopic_2nests_atm_intel failed in run_test
atmwav_control_noaero_p8_intel failed in run_test
cpld_control_pdlib_p8_intel failed in run_test
cpld_debug_noaero_p8_intel failed in run_test
cpld_debug_p8_intel failed in run_test
cpld_debug_pdlib_p8_intel failed in run_test
cpld_control_pdlib_p8_intel failed in run_test
cpld_debug_noaero_p8_intel failed in run_test
cpld_debug_p8_intel failed in run_test

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 As I said earlier, I can't see into your Gaea run directories. I will need to repeat the tests on Gaea.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 If you have another PR ready, please move on w/ it and give me time to debug.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 Also, are these failures Gaea specific, or are you seeing similar failures elsewhere?

@jkbk2004
Copy link
Collaborator

@jkbk2004 Also, are these failures Gaea specific, or are you seeing similar failures elsewhere?

@DeniseWorthen Issue is gaea specific. I think the issue might be resolved with resource increase like TPN=84 or 96. I think gaea default TPN=128.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 The test failures that do not include waves do not make any sense. There should not be any impact on a non-wave containing test. These three tests are not different than the develop branch.

hafs_regional_1nest_atm_intel failed in run_test
hafs_regional_storm_following_1nest_atm_intel failed in run_test
hafs_regional_telescopic_2nests_atm_intel failed in run_test

* to resolve failures on gaea, the default ww3 rearranger is set
as box on all platforms. All tests run with current resources w/
slight bump for the debug_pdlib case. A followup issue will be
created for the rearranger failure on Gaea and Gaea SAs will be
contacted.
@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 I've reset the default rearranger for WW3 to box for all cases. I've run all non-standalone baselines on Gaea w/ this change and all tests run to completion with the only difference w/ current baselines being the addition of the WW3 netcdf restart files. I believe this PR is now ready.

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 What is the scheduled commit date for this PR?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Nov 6, 2024

@jkbk2004 What is the scheduled commit date for this PR?

@DeniseWorthen wcoss2 is on maintenance this week. so we decided to let PRs with no baseline change go first. 11/11 is veterans day. We can schedule this PR on 11/12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add netcdf restart and history files using PIO (parallel netCDF) for dev/ufs-weather-model
3 participants