-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tpv_db_optimizer] Automatically update resource requests based on u… #64
base: main
Are you sure you want to change the base?
Conversation
…segalaxy historical data
At first glance, I felt:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @nuwang ... I tried adding a few comments and flagged a few tools that I find suspicious. Not sure how we should proceed with those.
testtoolshed.g2.bx.psu.edu/repos/simon-gladman/phyloseq_filter/biom_filter/.*: | ||
mem: 60.8 | ||
mem: 58.16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nuwang are those values the RAW values? Or did you already added some wiggle?
I would propose to round it up a bit and maybe add a comment # raw recom: 58.16
, or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's 5% wiggle room: https://github.com/nuwang/tpv-db-optimizer/blob/0cc2ba687b709120295476fd7da1f1500a551bdd/mem-optimize.py#L132
@@ -157,13 +268,13 @@ tools: | |||
mem: 60 | |||
toolshed.g2.bx.psu.edu/repos/bgruening/hicexplorer_chicqualitycontrol/hicexplorer_chicqualitycontrol/.*: | |||
cores: 20 | |||
mem: 60 | |||
mem: 9.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting ... I'm a bit sceptical about that, but maybe we should try.
@@ -271,24 +400,46 @@ tools: | |||
cores: 16 | |||
mem: 36 | |||
toolshed.g2.bx.psu.edu/repos/bgruening/rdock_sort_filter/rdock_sort_filter/.*: | |||
mem: 90 | |||
mem: 4.67 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume we had a reason for this. Do we take the filesize into account? I'm surprised, but I guess we just try again.
toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_sort_filter/rxdock_sort_filter/.*: | ||
mem: 90 | ||
mem: 4.67 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
mem: 2.51 | ||
toolshed.g2.bx.psu.edu/repos/devteam/cummerbund_to_tabular/cummerbund_to_cuffdiff/.*: | ||
mem: 0.48 | ||
toolshed.g2.bx.psu.edu/repos/devteam/data_manager_bowtie2_index_builder/bowtie2_index_builder_data_manager/.*: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take the max for data_managers, or create rules.
@@ -780,7 +1248,7 @@ tools: | |||
mem: 8 | |||
toolshed.g2.bx.psu.edu/repos/galaxyp/openms_idfileconverter/IDFileConverter/.*: | |||
cores: 4 | |||
mem: 8 | |||
mem: 0.64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging this one ... looks abnormal.
@@ -898,7 +1372,7 @@ tools: | |||
cores: 4 | |||
mem: 8 | |||
toolshed.g2.bx.psu.edu/repos/galaxyp/openms_openswathworkflow/OpenSwathWorkflow/.*: | |||
mem: 156 | |||
mem: 14.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outch, I guess this one needs a rule ?
@@ -2342,29 +3255,43 @@ tools: | |||
mem: 20 | |||
toolshed.g2.bx.psu.edu/repos/iuc/vardict_java/vardict_java/.*: | |||
cores: 2 | |||
mem: 128 | |||
mem: 36.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had a reason for this ... if I would just remember. A rule is probably needed here.
@@ -2406,22 +3383,171 @@ tools: | |||
toolshed.g2.bx.psu.edu/repos/peterjc/seq_select_by_id/seq_select_by_id/.*: | |||
mem: 8 | |||
toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/signalp3/.*: | |||
mem: 10 | |||
mem: 0.67 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag
@@ -2441,35 +3569,63 @@ tools: | |||
cores: 10 | |||
mem: 20 | |||
toolshed.g2.bx.psu.edu/repos/rnateam/paralyzer/paralyzer/.*: | |||
mem: 8 | |||
mem: 0.59 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag
Thanks @sanjaysrikakulam and @bgruening for reviewing, this feedback is super useful to identify cases to zoom into. Currently, the memory usage is calculated based on the max recorded cgroup memory usage across the past year. Only one year was used because older cgroup metrics are quite iffy - older ones seem to capture the entire node etc. etc. yielding incorrect results. The ones that have dramatically lower max values than recorded do need further investigation, but at least in my previous spot checks, were mostly correct at last as far as the cgroup metrics were concerned. I will check the ones you have flagged in greater detail and report back. I think there's still a fair bit more work to do to make sure we are recording correct metrics across the entire federation for cpu, memory and if possible, perhaps even gpu. We should also merge this oustanding PR soon so we can more easily track actual tpv mem allocations: https://github.com/galaxyproject/tpv-shared-database/pull/63/files |
I agree with your point - we don't want to cause mass failures and have a huge admin headache. Perhaps we can take a more incremental approach and experimentally lower just a few tools and see how well we do. The problem with a minimum allocation of say 4GB is that frequently used tools that actually don't require much memory result in substantial wastage in aggregate.
Agree, those are really weird. One possibility that comes to mind is that the captured metrics are somehow wrong. Another perhaps is that the tool just wasn't used that much during the past year, and hence whatever it was used for didn't use much memory. I'll investigate the one's you've both flagged and see what the older data says across federation data. |
An incremental approach would be a nice way to move forward. I agree about the aggregate wastage that may arise by setting a minimum allocation of 4GB, for example. Maybe we can start with a minimum of 2.5 GB allocation and then increase if and when necessary, case by case.
Yup, I had the same thought regarding the metrics. I was unaware that the calculations were limited to last year's data. It could be because of either of these. Thank you! |
@nuwang What do you think of the following idea?
|
@sanjaysrikakulam That sounds good to me. It'll take me a bit of time to address No. 1, because of some other pressing commitments, but I'll ping you when it's done. I like the plan you've outlined in No. 2. One thing - we should record the date we go live so that we can track actual vs expected gains. I'm hoping we can collectively author a paper on this. |
@nuwang another thing that I wanted to mention but always forget ... when we looked last time at this ... and figured that our c-group values are off ... we integrated https://usegalaxy.eu/?tool_id=stress_ng ... with this tool you can request memory and CPU exactly. Our idea was to run this tool with different params and check what cgroups and Galaxy reports. Maybe that is also useful for you. |
Sure! Let me know.
Certainly! We need to track and monitor. A paper will be great, I look forward to it! |
I tried Galaxy command_line:
I am not sure why this could be. Also, I do not see the -- Edit -- Cgroup Snippet from the galaxy job script.
We have CGroups V2 in use on the workers. Since we do not store the JWDs of successful jobs in the EU, I will test this on the ESG Galaxy instance to see whether the file |
Thanks both for pointing to this. For slurm, it's also necessary to configure cgroup settings so that a new cgroup is created before executing the job: usegalaxy-au/infrastructure#1374. I imagine something similar is necessary for HTCondor? You're still using that right? |
…segalaxy historical data.
Needs manual vetting.
This repository contains the code for generating automated updates: https://github.com/nuwang/tpv-db-optimizer
Once we refine the details, we can transfer the repo over to the galaxyproject.
These scripts rely on the existence of some materialized views in the usegalaxy.* databases, which are defined here:
https://github.com/nuwang/tpv-db-optimizer/blob/0cc2ba687b709120295476fd7da1f1500a551bdd/views.py#L2
Therefore, the script is currently run on a consolidated database which was manually assembled. Instead, it would probably be better if we could run it directly against usegalaxy.* databases, assuming the materialized views exist and are periodically updated.
These views could potentially be used for gxy.io/KUI as well.