ipython variant call signal 15, using more resources than requested, controller does not die #737

zeneofa · 2015-01-27T12:43:38Z

Hi,

I am running a variant calling ensemble pipeline, having a spot of trouble with the ipython cluster, it appears that one of the engines gets killed as it is using more memory than requested. The controller and main program just hangs and does not die, it appears to not detect the failure. I am unsure which part of the pipeline has caused the error or how to proceed.

I have tried to look at the log files, but find it difficult to connect the job_id with the processes that are called.

Some details:

I am running on PBSPro, currently there is no check_resource method, the engine and controllers still spin up though. Will this cause a potential problem leading to more resources used than requested?

(I altered the bcbio_nextgen.py -s option to allow pbspro specification. )

stderror from engine:

-bash-3.2$ tail bcbio-e.e765755
Processing reference #75 (GL000224.1)
Processing reference #76 (GL000223.1)
Processing reference #77 (GL000195.1)
Processing reference #78 (GL000212.1)
Processing reference #79 (GL000222.1)
Processing reference #81 (GL000193.1)
Processing reference #82 (GL000194.1)
Processing reference #83 (GL000225.1)
Processing reference #84 (GL000192.1)
=>> PBS: job killed: mem 2098076kb exceeded limit 2097152kb

last log info from ipython that mentions that job number:

log/ipython/log/ipcluster-d509a30c-cf8d-4bf9-b1f2-135eadc59c9f-23931.log <==
2015-01-27 08:37:40.024 [IPClusterStart] Job submitted with job id: u'765754'
2015-01-27 08:37:40.024 [IPClusterStart] Process 'qsub' started: u'765754'
2015-01-27 08:37:49.632 [IPClusterStart] Starting 1 Engines with cluster_helper.cluster.BcbioPBSPROEngineSetLauncher
2015-01-27 08:37:49.632 [IPClusterStart] Starting BcbioPBSPROEngineSetLauncher: ['qsub', u'./pbspro_engines']
2015-01-27 08:37:49.633 [IPClusterStart] adding PBS queue settings to batch script
2015-01-27 08:37:49.633 [IPClusterStart] adding job array settings to batch script
2015-01-27 08:37:49.633 [IPClusterStart] Writing batch script: ./pbspro_engines
2015-01-27 08:37:49.713 [IPClusterStart] Job submitted with job id: u'765755'
2015-01-27 08:37:49.713 [IPClusterStart] Process 'qsub' started: u'765755'
2015-01-27 08:41:49.714 [IPClusterStart] Engines appear to have started successfully

Different log output from ipython log file, suggesting process killed:

log/ipython/log/ipcontroller-e2c3a05d-4462-4a26-99bd-77955445cf89-15806.log <==
2015-01-27 08:35:47.485 [VMFixIPControllerApp] registration::finished registering engine 0:5613ba65-0102-4a9b-b4d1-3cf59b714501
2015-01-27 08:35:47.485 [VMFixIPControllerApp] engine::Engine Connected: 0
2015-01-27 08:35:51.671 [VMFixIPControllerApp] client::client '\x00\xaa\xf7\nZ' requested 'connection_request'
2015-01-27 08:35:51.671 [VMFixIPControllerApp] client::client ['\x00\xaa\xf7\nZ'] connected
2015-01-27 08:36:21.682 [VMFixIPControllerApp] client::client '\x00\xaa\xf7\n[' requested 'connection_request'
2015-01-27 08:36:21.683 [VMFixIPControllerApp] client::client ['\x00\xaa\xf7\n['] connected
2015-01-27 08:36:21.734 [VMFixIPControllerApp] task::task '9372bec4-0b3d-485f-919d-bea3ec103220' arrived on 0
2015-01-27 08:37:34.593 [VMFixIPControllerApp] task::task '9372bec4-0b3d-485f-919d-bea3ec103220' finished on 0
2015-01-27 08:37:36.357 [VMFixIPControllerApp] CRITICAL | Received signal 15, shutting down
2015-01-27 08:37:36.358 [VMFixIPControllerApp] CRITICAL | terminating children...

pbspro_engine submit script:

!/bin/sh

PBS -q workq

PBS -V

PBS -N bcbio-e

PBS -l select=1:ncpus=1:mem=2048mb

cd $PBS_O_WORKDIR
/lustre/SCRATCH5/users/pjones/bcbio/anaconda/bin/python -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240
; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else
10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance()' --timeout=960 --IPEngin
eApp.wait_for_url_file=960 --EngineFactory.max_heartbeat_misses=120 --profile-dir="/lustre/SCRATCH5/users/pjones/data_files/... --cluster-id=
"d509a30c-cf8d-4bf9-b1f2-135eadc59c9f"

last output produces by main stdout:

[2015-01-27T08:28Z] cnode-7-44: Resource requests: ; memory: 1.00; cores: 1
[2015-01-27T08:28Z] cnode-7-44: Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
[2015-01-27T08:28Z] cnode-7-44: multiprocessing: calc_callable_loci
[2015-01-27T08:28Z] cnode-7-44: bedtools genomecov: 1 : 10b_GGACTC_trimmed_paired_downsampled_sorted_markdup-reorder-fixrgs-gatkfilter-dedup.bam
[2015-01-27T08:29Z] cnode-7-44: bedtools genomecov: 2 : 10b_GGACTC_trimmed_paired_downsampled_sorted_markdup-reorder-fixrgs-gatkfilter-dedup.bam

Any help would be much appreciated. I was hoping to write the check_resource function for pbspro, but find it incredibly difficult to find appropriate documention on the qstat/pbsnodes function for pbspro. It is obviously based on torque, but I have not been able to determine the differences between the two in terms of those commands, nor what information the torque based check_resource function needs to extract from them in order to function.

Hope that made sense.

P

chapmanb · 2015-01-27T21:24:14Z

Thanks for the detailed report and sorry about the issue. It looks like 2Gb might not be enough memory to calculate callable regions and detect high depth issues in your input dataset. This is in the initial alignment step, so you could increase memory allocated by bumping up the specification in the samtools section of your bcbio_system.yaml file:

https://github.com/chapmanb/bcbio-nextgen/blob/master/config/bcbio_system.yaml#L27

Out of curiousity, are you only allowing 1 core for alignment and all these steps? Normally this is a multicore process so you get enough memory through the multiple cores x 2Gb/core specification. An alternative fix is to enable use of more cores here so your engine will have more allocated memory.

Hope this fixes the issue for you.

zeneofa · 2015-01-27T21:54:35Z

Hi Brad,

These are already aligned bam files, I am also not realigning. I am trying
to test lofreq in the context of ensemble calling. As I have not yet
managed to get it working with multiple cores. Freebayes also seems to
require only one core.

Will try your suggestion once I have access to the cluster again.

Thnks,
Piet
On 27 Jan 2015 23:31, "Brad Chapman" notifications@github.com wrote:

Thanks for the detailed report and sorry about the issue. It looks like
2Gb might not be enough memory to calculate callable regions and detect
high depth issues in your input dataset. This is in the initial alignment
step, so you could increase memory allocated by bumping up the
specification in the samtools section of your bcbio_system.yaml file:

https://github.com/chapmanb/bcbio-nextgen/blob/master/config/bcbio_system.yaml#L27

Out of curiousity, are you only allowing 1 core for alignment and all
these steps? Normally this is a multicore process so you get enough memory
through the multiple cores x 2Gb/core specification. An alternative fix is
to enable use of more cores here so your engine will have more allocated
memory.

Hope this fixes the issue for you.

—
Reply to this email directly or view it on GitHub
#737 (comment)
.

roryk · 2015-10-03T06:00:12Z

Thanks Piet, hope this fixed it for you, feel free to reopen if it didn't.

roryk closed this as completed Oct 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipython variant call signal 15, using more resources than requested, controller does not die #737

ipython variant call signal 15, using more resources than requested, controller does not die #737

zeneofa commented Jan 27, 2015

chapmanb commented Jan 27, 2015

zeneofa commented Jan 27, 2015

roryk commented Oct 3, 2015

ipython variant call signal 15, using more resources than requested, controller does not die #737

ipython variant call signal 15, using more resources than requested, controller does not die #737

Comments

zeneofa commented Jan 27, 2015

!/bin/sh

PBS -q workq

PBS -V

PBS -N bcbio-e

PBS -l select=1:ncpus=1:mem=2048mb

chapmanb commented Jan 27, 2015

zeneofa commented Jan 27, 2015

roryk commented Oct 3, 2015