Memory in peakcalling checkBams #78

IanSudbery · 2018-11-05T17:27:43Z

The function checkBams called by cgatpipelines.tasks.filterBams is seg-faulting or giving MemoryError. One solution is obviously to increase the amount of RAM. This seems like it will be a recurring problem though as files get bigger.

The problem appears to be that checkBams is building a dictionary of all the reads. Since the in memory structure will be decompressed BAM, it cannot take less than the decompressed BAM file would. Since the average compressed BAM file if in the 10s of GB, this is never going to work with real data of any size.

checkBams seems to check that the filtering has worked ... in what situation would this not be true. Surely we have to trust that, for example MarkDuplicates does what it says it will do?

The text was updated successfully, but these errors were encountered:

Acribbs · 2018-12-17T10:44:08Z

So you suggest that we remove checkBams?

I actually quite like the functionality of this code though because it generates an output table showing where the majority of your reads are filtered.

Since peakcalling is your baby, what's your opinion @Charlie-George?

IanSudbery · 2018-12-17T10:55:53Z

I personally think that any algo that is trying to hold a whole BAM file in memory is a non-starter. I agree that its useful to know at what stages reads are being filtered, but there needs to be a better way than this.

Acribbs · 2018-12-17T11:13:01Z

Ah I see your point, it seems like this is the offending line

cgat-flow/cgatpipelines/tasks/peakcalling.py

Line 577 in 06f35b7

samfile = pysam.AlignmentFile(infile, 'rb')

.

How about a workaround:

with pysam.AlignmentFile(infile, 'rb') as samfile:
    (continue with rest of code)

Then the iterator works on the object and not over the whole file, is this correct? Please tell me if im talking crap.

IanSudbery · 2018-12-17T11:18:29Z

The offending lines are these:

cgat-flow/cgatpipelines/tasks/peakcalling.py

Lines 607 to 609 in 06f35b7

    
           for read in samfile.fetch(): 
        
               d[read.query_name].append(read)

Charlie-George · 2018-12-17T11:25:17Z

Yep I totally agree its - far to memory consuming

I think the functions useful though, especially when we were checking we had flags correct etc, however now its stable we could take it out....

As a quick work around we added in a memory option - and my longer term plan was to rewrite it to be more memory efficent which hasn't been a top priority and won't be until after xmas, when I was planning to spend some time on this and actually split the bamfiltering step from the peakcalling pipeline into a bamprocessing pipeline for filtering, merging, downsampling bams, in which case it I think it would be usefull to have the checkbams funciton there

IanSudbery · 2018-12-17T14:10:34Z

Thought I'd do a quick check to see quite how memory consuming it was. _ tried to load a 2GB bam file (about 15 million reads) the same way on my desktop. This has now consumed over 20GB of RAM, and is still going (very, very slowly, as the machine is now thashing the disk).

So if we are talking more than 10x the size of the BAM file, and say a simple ChIP-seq expresiment has 9 samples, you'd need to find a lot of memory for what should be a fairly simple operation.

Acribbs · 2019-03-12T09:43:04Z

What was the consensus on checkBams?

Charlie-George · 2019-03-12T09:57:28Z

Its on the TODO list to change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory in peakcalling checkBams #78

Memory in peakcalling checkBams #78

IanSudbery commented Nov 5, 2018

Acribbs commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Acribbs commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Charlie-George commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Acribbs commented Mar 12, 2019

Charlie-George commented Mar 12, 2019

Memory in peakcalling checkBams #78

Memory in peakcalling checkBams #78

Comments

IanSudbery commented Nov 5, 2018

Acribbs commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Acribbs commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Charlie-George commented Dec 17, 2018

IanSudbery commented Dec 17, 2018

Acribbs commented Mar 12, 2019

Charlie-George commented Mar 12, 2019