Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the provis_state subpool at RK4 initialization to avoid memory leak #87

Closed

Conversation

sbrus89
Copy link
Collaborator

@sbrus89 sbrus89 commented Apr 2, 2024

I have noticed a memory leak in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (MPAS-Dev/compass#802) on pm-cpu. I can typically only get through about 42 days of simulation before running out of memory.

This issue is related to creating/destroying the provis_state subpool at each timestep. We had a similar issue a few years back that required memory leaks fixes in the mpas_pool_destroy_pool (MPAS-Dev/MPAS-Model#367) subroutine. However, I believe there is still a memory leak in the mpas_pool_remove_subpool (which calls pool_remove_member) that is called following mpas_pool_destroy_pool. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.

I'm not familiar enough with the details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the provis_state subpool once at initialization as opposed to creating and destroying it every timestep. This PR is meant to socialize this as a potential approach. The main consequence of this is that the mpas_pool_copy_pool subroutine needs to have a overrideTimeLevel option similar to that used in mpas_pool_clone_pool under the previous approach. I've tested these changes with the vr45to5 test case and they do allow me to run for a full 125 days.

 - Avoids create/destroy at each timestep
@sbrus89 sbrus89 added the bug Something isn't working label Apr 2, 2024
@mark-petersen
Copy link

@sbrus89 thanks for your work on this. In recent simulations by @jeremy-lilly and @gcapodag we had a memory leak problem with RK4 on perlmutter. We will retest, as this might take care of it! We were trying to run a 125m single-layer hurricane Sandy test, so very similar to your problem above, and it would run out of memory after a few days.

I spent time fixing RK4 memory leaks back in 2019. I obviously didn't fix them all, and greatly appreciate your effort on this now. For reference, here are the old issues and PRs:
MPAS-Dev/MPAS-Model#137
MPAS-Dev/MPAS-Model#142
MPAS-Dev/MPAS-Model#185

I will also conduct some tests. If they all work, we can move this to E3SM.

@mark-petersen
Copy link

Will compile with small fix:

--- a/components/mpas-framework/src/framework/mpas_pool_routines.F
+++ b/components/mpas-framework/src/framework/mpas_pool_routines.F
@@ -991,7 +991,7 @@ module mpas_pool_routines
-                  if (present(overrideTimeLevel)) then
+                  if (present(overrideTimeLevels)) then

Copy link
Collaborator

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbrus89, this looks great! I like the improvements you just added based on our discussion today.

@sbrus89
Copy link
Collaborator Author

sbrus89 commented Apr 3, 2024

Thanks @mark-petersen, hopefully this helps with @jeremy-lilly and @gcapodag's issue as well.

@sbrus89
Copy link
Collaborator Author

sbrus89 commented Apr 3, 2024

Thanks @xylar, I tried to make overrideTimeLevels operate analogously to the existing functionality in mpas_pool_clone_pool. Thanks again for your suggestions!

@gcapodag
Copy link

gcapodag commented Apr 3, 2024

Hi All, thanks @sbrus89 for bringing this up. LTS and FB-LTS that are both merged in master now, also use the same process of cloning pools. Please if you are going to merge this into master, include the changes also on those two time stepping methods so that they are not left out.

@mark-petersen
Copy link

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel

Copy link

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbrus89 since this shows 'no harm done' for the solution and fixes a memory leak, please move this PR over to E3SM-Project.

@xylar
Copy link
Collaborator

xylar commented Apr 4, 2024

I agree with @gcapodag, let's include LTS and FB-LTS before this goes to E3SM.

@gcapodag
Copy link

gcapodag commented Apr 4, 2024

Thanks everyone for your work on this. I have just submitted a job on Perlmutter to see if it fixes our problem. I'll keep you all posted.

@sbrus89
Copy link
Collaborator Author

sbrus89 commented Apr 4, 2024

Sounds good @gcapodag, I'll add these changes for LTS and FB-LTS as well

@gcapodag
Copy link

gcapodag commented Apr 4, 2024

Great, thank you very much @sbrus89 !!

@gcapodag
Copy link

gcapodag commented Apr 5, 2024

Hi All, just wanted to confirm that this fix on RK4 allowed us to run on Perlmutter for 25 days on a mesh with 4617372 cells and a highest resolution near the coast of 125 m. Before this fix, the run would crash for a segmentation fault around two days in, and we were able to make it proceed past that point only using around 5% of the node capacity using these specs for srun: srun -N 64 -n 128 --ntasks-per-node=2 --ntasks-per-core=1. Thanks a lot to you all for figuring out a solution!

@sbrus89
Copy link
Collaborator Author

sbrus89 commented Apr 5, 2024

@gcapodag, I pushed the LTS/FB-LTS changes but haven't tested them yet.

@gcapodag
Copy link

gcapodag commented Apr 5, 2024

thanks @sbrus89 , I just tested on a 2 hr run on Perlmutter and the changes to LTS and FB-LTS are BFB.

@sbrus89
Copy link
Collaborator Author

sbrus89 commented Apr 5, 2024

Thanks very much for testing, @gcapodag! I think I'll go ahead and move this over to E3SM in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants