-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303
base: dev/ufs-weather-model
Are you sure you want to change the base?
Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303
Conversation
* remove wav_grdout routine, now moved into wav_history_mod
* remove more cap stuff from w3iorsmd. only ww3 native filenaming is possible w/ binary restarts * remove ifdef w3_ascii from w3wavemd, since the ascii commit is not yet present in mesh cap branch * clean up config variable logic for filenaming
* nrqrs can be non-zero if also using the restart from binary option
* flpart needs to be set for either historync or not
* move restart and history nc blocks outside of ww3 time testing block. * add log print messages for reading and writing restarts
* get logging working correctly for ufs * fix noclobber variable and file name in wav_history * clarify some comments
* make binary history files match when restartnc=true * rework wav_restart_mod, which originally was designed to be able to read and write restarts for testing purposes from inside wav_comp_nuopc.
* verboselog is true by default, but can be set false by config * the header text for ww3 logging as it steps through time is now turned off in w3init and placed into the mesh cap. this allows the mesh cap to order the logging correctly
* move block where addrstflds was set to before call to w3init since restarts are read in w3init * ensure that if nml file lacks a specification of extra fields, the default value of "unset" will not be returned as a field name
* only ice is added for now * tab cleanup in w3grid
* need to send explicit array bounds for ice array since it is 0:nsea * all baselines b4b against f9531d0
* intialize floutg and floutg2 which are can be unintialized when waves are in slow loop and historync is true
@MatthewMasarik-NOAA Can I ask if you've tested the mesh you provided me in debug mode? When using the dev/ufs-weather-model branch, I am getting a failure in debug mode at LN 1388 in w3init Lines 1384 to 1398 in 7f548c7
|
Hi @DeniseWorthen, I have not tested this mesh in either standard or debug mode. I was provided this HR4 mesh from @AliS-Noaa and @JessicaMeixner-NOAA who have done the unstructured grid creation, and could probably give that information. You had mentioned offline that at one point that Ali had told you a grid was passing in standalone, but not in coupled. I had not been briefed on that. Do you know if anyone followed up on that at the time? Or can you confirm if this new mesh is the same as the one you were discussing with Ali? |
@MatthewMasarik-NOAA The remark about failing in coupled mode was from an informal lunch-time discussion I had w/ Ali, some time ago, where he simply mentioned in passing what he was working on. It was hardly a "briefing" I had on the subject. I have only the mesh you provided me, as well as one which Jiande gave me which he is currently using for HR4. |
@DeniseWorthen, understood. It was more than has been shared with me so you know more than I do at this point. |
I think we might be getting meshes confused as well. HR4 was supposed to originally have an update and use the new mesh from @AliS-Noaa - however, due to time it's using the same as HR3. A new HR with updated wave grids will happen. @sbanihash I think @DeniseWorthen has pointed out an oversight from us that we should test this new mesh with debug mode on - would you be able to assign someone to run the new mesh in debug mode? |
@JacobCarley-NOAA @junwang-noaa @GeorgeVandenberghe-NOAA @CatherineThomas-NOAA I've updated the PR description above with the results of my performance testing. I'm tagging you all here in case you don't watch this repo. |
@DeniseWorthen Thanks for the testing. Would you please also check the max time instead of mean time for ww3? |
@junwang-noaa If I look at the time-series, the maximum time will most likely always be at hour=6, since WW3 gets a bit faster as it spins up. I'm not sure max gives a good measure of the cost of a restart write in this case. |
@DeniseWorthen excellent work - thanks for tagging me. The new capability seems to show a nice improvement. Do you have performance numbers for the overall runtime? (e.g. for whatever a typical forecast length is in operations). |
@JacobCarley-NOAA For these tests, I was using a DATM configuration in order to focus solely on WW3 restart performance. I have not done any testing w/ the actual operational configuration. I think George V. has been doing that testing. |
@DeniseWorthen Let me clarify if I understand correctly. By "max time" I mean the "Max" time in ESMF_Profile.summary, usually that time gives the total run time of a model component. |
I have. With the hr4 testcase I got from Ruiyu, enabling restarts adds 800
seconds to a 72 hour forecast with 7 restarts (12 hour restart interval)
on WCOSS2. I will note again the Ruyiu testcase does not run on hercules
or orion.. hangs after wave model is reported as starting. This is NOT
evidence the issue is in wave. There is a second point output issue with
WAVE which adds 2000 seconds to startup once but that is known and being
addressed so I am not worried about it and have reduced point outputs to
almost none to get rid of it for now.
…On Tue, Oct 15, 2024 at 8:57 AM Denise Worthen ***@***.***> wrote:
@JacobCarley-NOAA <https://github.com/JacobCarley-NOAA> For these tests,
I was using a DATM configuration in order to focus solely on WW3 restart
performance. I have not done any testing w/ the actual operational
configuration. I think George V. has been doing that testing.
—
Reply to this email directly, view it on GitHub
<#1303 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSBYDY25PAJRKHYKVTZ3UGMPAVCNFSM6AAAAABOYROF42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA2DINRYGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
When Gaea-C5 comes back from maintenance, I will add that information. |
Thanks for the tag @DeniseWorthen. Tagging @RuiyuSun as well. |
@DeniseWorthen awesome work! This is very exciting. Thanks for adding the run details to the PR description. |
@DeniseWorthen Is your branch https://github.com/DeniseWorthen/WW3/tree/feature/pio4ww3 ready for testing in HR4? If yes, @GeorgeVandenberghe-NOAA can you use Denise's branch to test ww3 IO performance in HR4? Thanks |
@junwang-noaa I am putting in a final change, which is to error out if the dimensions in the netcdf restart file don't match. It should be finished by today. There will need to be changes in |
* tested by using nco to alter the nth,nk values in a restart file. The model correctly aborted after finding nth did not match the expected value
@MatthewMasarik-NOAA we are about to merge ufs-community/ufs-weather-model#2463. We consider to work on this PR after the PR 2463. Let me know if you need more time to review this PR. |
Hi @jkbk2004. I have the tests going on hera right now, and am working on the code review. It is a significant PR though, so I could use a little more time to review if that's OK for your schedule. |
* delay the deallocation of gindex arrays until after the diagnose_mesh routine is called. This only impacts cases where dbug_flag>5, compiled with gnu+debug
The RTs just came back and here is the test_changes.list. The tests were run before the latest commit, and from a quick look at the list, the changes might all be explained by that. I'll have a closer look tomorrow. |
The test_changes.list is not related to the latest commit---that fixes only an issue if dbug_flag>5 and dbug_flag=0 for all tests. The test_changes.list is because we do not compare ww3 restart files for any cpld test, because ww3 restart files do not reproduce. All of tests should report the reason for failure is that the restart file (prefix.ww3.r.date.nc) does not exist in the baseline. Let me know if that is not the case. |
Okay, thanks for that information. I suppose I should re-run the tests at any rate after the new commit. Do you anticipate any more commits? |
I will run against the UWM baseline on hercules overnight and post the test_changes.list. |
I ran the RTs last night on hera RegressionTests_hera.log, test_changes.list. Most on the change list are failures due to comparision of wave restarts (which do not pass), as @DeniseWorthen pointed out. A few are 'restart' tests that failed to start, so that is expected as well. There are only two changes then remaining: |
I think the issue there is that currently in the rt.conf, The iau test did not run for the same reason. The test it is dependent on ( |
Great, thank you for giving that clarification. I'm okay with the testing then. I'll work on finishing the code review today. If everything goes well it should be ready late today or early tomorrow. Fyi @jkbk2004 |
@MatthewMasarik-NOAA all sounds good. we can go ahead to commit for full testing this afternoon. @DeniseWorthen can you attach hera pre-test log and push test_changes.list at ufs-community/ufs-weather-model#2445 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
Pass
Testing
Pass
See logs attached to #1303 (comment) and description in #1303 (comment) as well.
Approved.
thanks for your patience |
Pull Request Summary
Enable netCDF restarts and run-time netCDF history using PIO.
Description
PIO is configured via configuration settings in WAV attributes for UFS and via share code for CESM. Defaults are provided for the stride, number of io tasks and rearranger settings. The PIO subsystem is managed and initialized in
wav_pio_mod.F90
PIO/netCDF restarts are enabled by using
use_restartnc=true
in the configuration attributes, otherwise native WW3 binary restarts will be written. If netCDF restarts are enabled, an additional option exists to restart from an existing binary restart (restart_from_binary=true
). In this case, the binary restart filename needs the same form as the netCDF restarts (minus the .nc at the end).With netCDF restarts enabled, restarts will be written at the
rstwr
frequency defined in the cap, which will allow both the flexible (ie, non-interval) restarts and end-of-run restarts to be enabled in the same way as for other coupled components in UFS. These two features are not implemented in this work.The netCDF restarts are read and written in a new routine
wav_restart_mod.F90
. Only two fields (va
andmapsta
) are written to the restart file by default. As noted in the comment in #1298, when waves are in the slow-coupling loop, the ice field is also written. The addition of extra restart fields is generalized via the already-existingrestart%extra
namelist.Runtime netCDF history is enabled by using
use_historync=true
in the configuration attributes, otherwise native WW3 binary history files will be written. The PIO/netCDF capability builds on the existing serial netCDF capability in the mesh cap. The two existing routines used for this (w3iogoncdmd.F90, wav_grdout.F90
) have been reduced to a new single netCDF history modulewav_history_mod.F90
containing the same functionality.History files can be custom named (if
user_histname = true
), otherwise they will be named in the native WW3 file naming. History files will be written at thehistwr
frequency. This frequency will be set using the normalhistory_n,history_option
settings, if present. Otherwise, thefield%stride
frequency will be used to determine thehistwr
frequency. (This requires that the stride be set in units of seconds, which is less flexible than enabling the history_n,history_option settings.)w3wavemd, w3iorsmd and w3iogomd.
.Modifications to the three listed routines were required for the mesh cap to allow for additional flexibility of restart and history writing. The changes in w3iogomd were resolved w/ the PR to dev/ufs-weather-model for the Langmuir turbulence calculations (#1034). With the new restartnc and historync options via PIO, the changes to the other two routines (wavemd and iorsmd) are no longer required, meaning these two routines are now much more aligned with the same code in the develop branch. Since the dev/ufs-weather-model branch has not been updated for nearly a year, the modifications are best seen by comparing these two routines in this PR branch against the develop branch at 4d8c315
Performance Testing
To test the performance of the new PIO/netCDF restarts vs the existing binary restarts, a series of runs were made using a DATM + 1/4 deg MOM6+CICE6 and WW3 utilizing the
uglo_15km-hr4_BlkSlake_360
mesh (~1.8M spatial elements) with 1800 frequencies + directions. Although this PR also enables PIO to write the run-time netCDF history files (vs the current serial netCDF implementation), no performance testing was done for the history writes.Cases were run for 24 hours, with all component output turned off except for WW3 restarts, which were written every 2 hours. A range of resources was tested for WW3, from ~400 to ~2800. The primary metric used for evaluation was the mean time for a Model Advance on non-restart intervals vs restart intervals (the restart-write "penalty"). Because WW3 is cold-starting, the first 6 hours of integration was ignored.
The primary configuration options for PIO are the stride (and/or number of IO tasks) and the rearranger. In addtion to these options, two options were tested for writing the
VA
field---either as a single array, dimensioned (1.8M,1800) or with each spectral index written to a separate, named field (e.g.va.0001
throughva.1800
). This "multi-field" option creates an identically sized restart file and has been also tested in UWM, where it passes all tests. Thanks go to @jedwards4b for helping me get the single array working for subset rearranger and the operational mesh, which allowed me to test the single field vs multi-field options.Initial scout runs were made to identify the best combination of stride and rearranger and va-array structure. At all WAV tasks tests, the best results were consistently found using
pio_stride=4
,pio_rearranger=subset
and with theVA
field written as multiple fields. The finding that stride=4 was a little unexpected, since by default, as resources increase the stride is also set to increase. I think that because we are under-subscribing the nodes in order to provide enough memory to WW3, this acts as a sort of de-facto striding and stride=4 (and even stride=2) can be used even at high task counts.Once the best combination of parameters were found for PIO, cases were run for a total of 3 times. The binary case was also run three times and all runs were averaged. The raw results can be found here
The summary results are presented below.
Issue(s) addressed
Commit Message
Check list
Testing
Using UFS-WM, a baseline was created and verified to pass for all tests which include a WAV component. Full testing as well as scalability results will be documented in the associated UFS-WM PR.