-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create the provis_state
subpool at RK4 initialization to avoid memory leak
#87
Conversation
- Avoids create/destroy at each timestep
f66f372
to
1ebf854
Compare
@sbrus89 thanks for your work on this. In recent simulations by @jeremy-lilly and @gcapodag we had a memory leak problem with RK4 on perlmutter. We will retest, as this might take care of it! We were trying to run a 125m single-layer hurricane Sandy test, so very similar to your problem above, and it would run out of memory after a few days. I spent time fixing RK4 memory leaks back in 2019. I obviously didn't fix them all, and greatly appreciate your effort on this now. For reference, here are the old issues and PRs: I will also conduct some tests. If they all work, we can move this to E3SM. |
Will compile with small fix:
|
1ebf854
to
12368b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbrus89, this looks great! I like the improvements you just added based on our discussion today.
Thanks @mark-petersen, hopefully this helps with @jeremy-lilly and @gcapodag's issue as well. |
Thanks @xylar, I tried to make |
Hi All, thanks @sbrus89 for bringing this up. LTS and FB-LTS that are both merged in master now, also use the same process of cloning pools. Please if you are going to merge this into master, include the changes also on those two time stepping methods so that they are not left out. |
Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:
In E3SM, passes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbrus89 since this shows 'no harm done' for the solution and fixes a memory leak, please move this PR over to E3SM-Project.
I agree with @gcapodag, let's include LTS and FB-LTS before this goes to E3SM. |
Thanks everyone for your work on this. I have just submitted a job on Perlmutter to see if it fixes our problem. I'll keep you all posted. |
Sounds good @gcapodag, I'll add these changes for LTS and FB-LTS as well |
Great, thank you very much @sbrus89 !! |
Hi All, just wanted to confirm that this fix on RK4 allowed us to run on Perlmutter for 25 days on a mesh with |
12368b9
to
7d8f123
Compare
7d8f123
to
2f49d49
Compare
@gcapodag, I pushed the LTS/FB-LTS changes but haven't tested them yet. |
thanks @sbrus89 , I just tested on a 2 hr run on Perlmutter and the changes to LTS and FB-LTS are BFB. |
Thanks very much for testing, @gcapodag! I think I'll go ahead and move this over to E3SM in that case. |
I have noticed a memory leak in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (MPAS-Dev/compass#802) on pm-cpu. I can typically only get through about 42 days of simulation before running out of memory.
This issue is related to creating/destroying the
provis_state
subpool at each timestep. We had a similar issue a few years back that required memory leaks fixes in thempas_pool_destroy_pool
(MPAS-Dev/MPAS-Model#367) subroutine. However, I believe there is still a memory leak in thempas_pool_remove_subpool
(which callspool_remove_member
) that is called followingmpas_pool_destroy_pool
. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.I'm not familiar enough with the details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the
provis_state
subpool once at initialization as opposed to creating and destroying it every timestep. This PR is meant to socialize this as a potential approach. The main consequence of this is that thempas_pool_copy_pool
subroutine needs to have aoverrideTimeLevel
option similar to that used inmpas_pool_clone_pool
under the previous approach. I've tested these changes with the vr45to5 test case and they do allow me to run for a full 125 days.