You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pbl: mpas-o doesn't abort early enough for block stats to helpful to diagnose source of errors when a crash happens (as seen in the crashed during the v3 critical path crashes, e.g. v3alpha04bigrid crash investigation), requiring restarting with high freq outputs to chase the source of the crash.
It generates 20+ blocks, which I never look at and ocean_validate() exits when you have NaN in too many fields to be informative.
Current state:
I tested a couple of approaches adding a tracer check (back in Oct 2023) ocn_validate ocn_validate2
In the record of runs, these fail and exit with a single block -- which I find more useful.
The second one was an attempt to get a more useful error message (detailing the reason for the fail) but it needs more work.
@cbegeman you may want to add a check like that in your current HR debugging if that helps.
Note: this investigation also revealed a separate issue with the abort() call in framework/maps_log.F, which hangs instead of exiting cleanly when on a distributed layout. Flat/stacked layout is fine. @jonbob was a real help on this!
The text was updated successfully, but these errors were encountered:
Goal: make better use of mpaso blocks
Pbl: mpas-o doesn't abort early enough for block stats to helpful to diagnose source of errors when a crash happens (as seen in the crashed during the v3 critical path crashes, e.g. v3alpha04bigrid crash investigation), requiring restarting with high freq outputs to chase the source of the crash.
It generates 20+ blocks, which I never look at and ocean_validate() exits when you have NaN in too many fields to be informative.
Current state:
I tested a couple of approaches adding a tracer check (back in Oct 2023)
ocn_validate
ocn_validate2
In the record of runs, these fail and exit with a single block -- which I find more useful.
The second one was an attempt to get a more useful error message (detailing the reason for the fail) but it needs more work.
@cbegeman you may want to add a check like that in your current HR debugging if that helps.
Note: this investigation also revealed a separate issue with the abort() call in framework/maps_log.F, which hangs instead of exiting cleanly when on a distributed layout. Flat/stacked layout is fine. @jonbob was a real help on this!
The text was updated successfully, but these errors were encountered: