-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipython variant call signal 15, using more resources than requested, controller does not die #737
Comments
Thanks for the detailed report and sorry about the issue. It looks like 2Gb might not be enough memory to calculate callable regions and detect high depth issues in your input dataset. This is in the initial alignment step, so you could increase memory allocated by bumping up the specification in the https://github.com/chapmanb/bcbio-nextgen/blob/master/config/bcbio_system.yaml#L27 Out of curiousity, are you only allowing 1 core for alignment and all these steps? Normally this is a multicore process so you get enough memory through the multiple cores x 2Gb/core specification. An alternative fix is to enable use of more cores here so your engine will have more allocated memory. Hope this fixes the issue for you. |
Hi Brad, These are already aligned bam files, I am also not realigning. I am trying Will try your suggestion once I have access to the cluster again. Thnks,
|
Thanks Piet, hope this fixed it for you, feel free to reopen if it didn't. |
Hi,
I am running a variant calling ensemble pipeline, having a spot of trouble with the ipython cluster, it appears that one of the engines gets killed as it is using more memory than requested. The controller and main program just hangs and does not die, it appears to not detect the failure. I am unsure which part of the pipeline has caused the error or how to proceed.
I have tried to look at the log files, but find it difficult to connect the job_id with the processes that are called.
Some details:
I am running on PBSPro, currently there is no check_resource method, the engine and controllers still spin up though. Will this cause a potential problem leading to more resources used than requested?
(I altered the bcbio_nextgen.py -s option to allow pbspro specification. )
stderror from engine:
-bash-3.2$ tail bcbio-e.e765755
Processing reference #75 (GL000224.1)
Processing reference #76 (GL000223.1)
Processing reference #77 (GL000195.1)
Processing reference #78 (GL000212.1)
Processing reference #79 (GL000222.1)
Processing reference #81 (GL000193.1)
Processing reference #82 (GL000194.1)
Processing reference #83 (GL000225.1)
Processing reference #84 (GL000192.1)
=>> PBS: job killed: mem 2098076kb exceeded limit 2097152kb
last log info from ipython that mentions that job number:
log/ipython/log/ipcluster-d509a30c-cf8d-4bf9-b1f2-135eadc59c9f-23931.log <==
2015-01-27 08:37:40.024 [IPClusterStart] Job submitted with job id: u'765754'
2015-01-27 08:37:40.024 [IPClusterStart] Process 'qsub' started: u'765754'
2015-01-27 08:37:49.632 [IPClusterStart] Starting 1 Engines with cluster_helper.cluster.BcbioPBSPROEngineSetLauncher
2015-01-27 08:37:49.632 [IPClusterStart] Starting BcbioPBSPROEngineSetLauncher: ['qsub', u'./pbspro_engines']
2015-01-27 08:37:49.633 [IPClusterStart] adding PBS queue settings to batch script
2015-01-27 08:37:49.633 [IPClusterStart] adding job array settings to batch script
2015-01-27 08:37:49.633 [IPClusterStart] Writing batch script: ./pbspro_engines
2015-01-27 08:37:49.713 [IPClusterStart] Job submitted with job id: u'765755'
2015-01-27 08:37:49.713 [IPClusterStart] Process 'qsub' started: u'765755'
2015-01-27 08:41:49.714 [IPClusterStart] Engines appear to have started successfully
Different log output from ipython log file, suggesting process killed:
log/ipython/log/ipcontroller-e2c3a05d-4462-4a26-99bd-77955445cf89-15806.log <==
2015-01-27 08:35:47.485 [VMFixIPControllerApp] registration::finished registering engine 0:5613ba65-0102-4a9b-b4d1-3cf59b714501
2015-01-27 08:35:47.485 [VMFixIPControllerApp] engine::Engine Connected: 0
2015-01-27 08:35:51.671 [VMFixIPControllerApp] client::client '\x00\xaa\xf7\nZ' requested 'connection_request'
2015-01-27 08:35:51.671 [VMFixIPControllerApp] client::client ['\x00\xaa\xf7\nZ'] connected
2015-01-27 08:36:21.682 [VMFixIPControllerApp] client::client '\x00\xaa\xf7\n[' requested 'connection_request'
2015-01-27 08:36:21.683 [VMFixIPControllerApp] client::client ['\x00\xaa\xf7\n['] connected
2015-01-27 08:36:21.734 [VMFixIPControllerApp] task::task '9372bec4-0b3d-485f-919d-bea3ec103220' arrived on 0
2015-01-27 08:37:34.593 [VMFixIPControllerApp] task::task '9372bec4-0b3d-485f-919d-bea3ec103220' finished on 0
2015-01-27 08:37:36.357 [VMFixIPControllerApp] CRITICAL | Received signal 15, shutting down
2015-01-27 08:37:36.358 [VMFixIPControllerApp] CRITICAL | terminating children...
pbspro_engine submit script:
!/bin/sh
PBS -q workq
PBS -V
PBS -N bcbio-e
PBS -l select=1:ncpus=1:mem=2048mb
cd $PBS_O_WORKDIR
/lustre/SCRATCH5/users/pjones/bcbio/anaconda/bin/python -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240
; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else
10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance()' --timeout=960 --IPEngin
eApp.wait_for_url_file=960 --EngineFactory.max_heartbeat_misses=120 --profile-dir="/lustre/SCRATCH5/users/pjones/data_files/... --cluster-id=
"d509a30c-cf8d-4bf9-b1f2-135eadc59c9f"
last output produces by main stdout:
[2015-01-27T08:28Z] cnode-7-44: Resource requests: ; memory: 1.00; cores: 1
[2015-01-27T08:28Z] cnode-7-44: Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
[2015-01-27T08:28Z] cnode-7-44: multiprocessing: calc_callable_loci
[2015-01-27T08:28Z] cnode-7-44: bedtools genomecov: 1 : 10b_GGACTC_trimmed_paired_downsampled_sorted_markdup-reorder-fixrgs-gatkfilter-dedup.bam
[2015-01-27T08:29Z] cnode-7-44: bedtools genomecov: 2 : 10b_GGACTC_trimmed_paired_downsampled_sorted_markdup-reorder-fixrgs-gatkfilter-dedup.bam
Any help would be much appreciated. I was hoping to write the check_resource function for pbspro, but find it incredibly difficult to find appropriate documention on the qstat/pbsnodes function for pbspro. It is obviously based on torque, but I have not been able to determine the differences between the two in terms of those commands, nor what information the torque based check_resource function needs to extract from them in order to function.
Hope that made sense.
P
The text was updated successfully, but these errors were encountered: