Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory for PepQuery2 #30

Open
mira-miracoli opened this issue Jul 11, 2023 · 7 comments
Open

Memory for PepQuery2 #30

mira-miracoli opened this issue Jul 11, 2023 · 7 comments

Comments

@mira-miracoli
Copy link
Collaborator

mira-miracoli commented Jul 11, 2023

Not 100% sure where to place this issue, but I thought it might be interesting for all users of the shared database. Otherwise I can of course move it to EU.

I am currently debugging an error the pepquery2 tool. The job errored because the JVM run out of memory.
When I tried to run the job locally I had to stop at 14G because my laptop (16G) started to lag.
I noticed that:

  • it uses 8 cores (from the logs) even though it was allocated 1 core
  • the index creation set is much slower on the server than on my laptop (might be storage related)
  • there is no rule for TPV (and was no rule for sortinghat)

I would like to change that, but I am not sure which values I should consider. In their documentation I found a recommendation for 8 GB of memory and 4 CPUs which is too little for at least the job I am looking at. When I tried to use gxadmin query tool-memory-per-inputs I found:

    id    |                                tool_id                                 | input_count | total_input_size_mb | mean_input_size_mb | median_input_size_mb | memory_used_mb | memory_used_per_input_mb | memory_mean_input_ratio | memory_median_input_ratio                                                                                                               
----------+------------------------------------------------------------------------+-------------+---------------------+--------------------+----------------------+----------------+--------------------------+-------------------------+--------------------------- 
 ######### | toolshed.g2.bx.psu.edu/repos/galaxyp/pepquery2/pepquery2/2.0.2+galaxy0 |           1 |                  36 |                 36 |                   36 |         283829 |            7948 |                    7948 |                      7948                           

While gxadmin report job-info returned the following:

## Destination Parameters                                                                                                                                                                    
                                                                                 
Key | Value                                                                                                                                                                                  
--- | ---
+Group | `""`
accounting_group_user | `#####`
description | `pepquery2`                                                                                                                                                                    
docker_memory | `3.8G`                                                    
metadata_strategy | `extended`                                                               
request_cpus | `1`                                                                                                                                                                           
request_memory | `3.8G`
requirements | `(GalaxyGroup  ==  "compute")`
submit_request_gpus | `0`    

I am now trying to figure out how to implement a rule here and if we have to change something in the wrapper because of the CPU usage. Since I never used the tool myself I would be happy about any hints from people who have some experience with it.

@mira-miracoli
Copy link
Collaborator Author

I increased the memory for the erroring job to 16G and it finished.
Since I have not enough data to come up with a sensible rule, I would suggest to set the mem to 16G for now

@mira-miracoli
Copy link
Collaborator Author

mira-miracoli commented Jul 12, 2023

Unfortunately the tool seems not to be satisfied and condor complains it tries to exceed the 16G:

007 (44662823.000.000) 07/12 09:25:00 Shadow exception!
        Error from slot1_5@vgcnbwc-worker-c36m225-4528.novalocal: Job has gone over memory limit of 16384 megabytes. Peak usage: 16331 megabytes.
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
012 (44662823.000.000) 07/12 09:25:00 Job was held.
        Error from slot1_5@vgcnbwc-worker-c36m225-4528.novalocal: Job has gone over memory limit of 16384 megabytes. Peak usage: 16331 megabytes.
        Code 34 Subcode 0

The -Xmx16g was set, so this should not happen. However, since this job seems to need more than 16Gb, we could try to increase it further

@mira-miracoli
Copy link
Collaborator Author

mira-miracoli commented Jul 12, 2023

I am not an Java expert, but I would assume, that there is some kind of overhead that adds to the 16GB that the JVM uses as heap

EDIT: This is what I learned so far. Since the wrapper currently defines -Xmx{mem}g, I need to change the wrapper accordingly
ima_391fc7d

@nuwang
Copy link
Member

nuwang commented Jul 13, 2023

Looks like it supports a -cpu parameter: http://www.pepquery.org/document.html#saparameter
and the wrapper will indeed need to be modified.

@mira-miracoli
Copy link
Collaborator Author

f no one opposes, I would open a PR with that set accordingly in _JAVA_OPTIONS

Looks like it supports a -cpu parameter: http://www.pepquery.org/document.html#saparameter and the wrapper will indeed need to be modified.

yes by default it uses all cores available to it, but it would be cleaner to use it, I guess.

@mira-miracoli
Copy link
Collaborator Author

I got this from Galaxy, probably, because I run je job manually and it was still watched by the Galaxy Handlers:

Job Metrics
cgroup

CPU Time | 2 hours and 54 minutes
-- | --
Failed to allocate memory count | 0E-7
Memory limit on cgroup (MEM) | 48.0 GB
Max memory usage (MEM) | 17.5 GB
Memory limit on cgroup (MEM+SWP) | 8.0 EB
Max memory usage (MEM+SWP) | 17.5 GB
OOM Control enabled | No
Was OOM Killer active? | No
Memory softlimit on cgroup | 0 bytes

...

Destination Parameters

Runner | condor
-- | --
Runner Job ID | 44667975
Handler | handler_sn06_3
+Group | ""
accounting_group_user | 55103
description | pepquery2
docker_memory | 16G
metadata_strategy | extended
request_cpus | 1
request_memory | 16G

@mira-miracoli
Copy link
Collaborator Author

The job is stopped by condor for exceeding its memory:

007 (xxxxxxxxx.000.000) 07/14 11:35:09 Shadow exception!                                                                                                                                      
        Error from slot1_8@vgcnbwc-worker-xxxxxxx: Job has gone over memory limit of 16384 megabytes. Peak usage: 16328 megabytes.                                            
        0  -  Run Bytes Sent By Job                                                                                       
        0  -  Run Bytes Received By Job                                                                                   
...                                                                                                                                                                                          
012 (xxxxxxxxxxxx.000.000) 07/14 11:35:09 Job was held.                                                                                                                                          
        Error from slot1_8@vgcnbwc-worker-xxxxxxxxl: Job has gone over memory limit of 16384 megabytes. Peak usage: 16328 megabytes.                                            
        Code 34 Subcode 0                                                                                                                                                                    
...                                                                                                                                                                                          
013 (xxxxxxxxxx.000.000) 07/14 11:45:02 Job was released.                                                                                                                                      
        via condor_release (by user galaxy)      

Here is a PR to increase it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants