Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory in peakcalling checkBams #78

Open
IanSudbery opened this issue Nov 5, 2018 · 8 comments
Open

Memory in peakcalling checkBams #78

IanSudbery opened this issue Nov 5, 2018 · 8 comments

Comments

@IanSudbery
Copy link
Contributor

The function checkBams called by cgatpipelines.tasks.filterBams is seg-faulting or giving MemoryError. One solution is obviously to increase the amount of RAM. This seems like it will be a recurring problem though as files get bigger.

The problem appears to be that checkBams is building a dictionary of all the reads. Since the in memory structure will be decompressed BAM, it cannot take less than the decompressed BAM file would. Since the average compressed BAM file if in the 10s of GB, this is never going to work with real data of any size.

checkBams seems to check that the filtering has worked ... in what situation would this not be true. Surely we have to trust that, for example MarkDuplicates does what it says it will do?

@Acribbs
Copy link
Contributor

Acribbs commented Dec 17, 2018

So you suggest that we remove checkBams?

I actually quite like the functionality of this code though because it generates an output table showing where the majority of your reads are filtered.

Since peakcalling is your baby, what's your opinion @Charlie-George?

@IanSudbery
Copy link
Contributor Author

I personally think that any algo that is trying to hold a whole BAM file in memory is a non-starter. I agree that its useful to know at what stages reads are being filtered, but there needs to be a better way than this.

@Acribbs
Copy link
Contributor

Acribbs commented Dec 17, 2018

Ah I see your point, it seems like this is the offending line

samfile = pysam.AlignmentFile(infile, 'rb')
.

How about a workaround:

with pysam.AlignmentFile(infile, 'rb') as samfile:
    (continue with rest of code)

Then the iterator works on the object and not over the whole file, is this correct? Please tell me if im talking crap.

@IanSudbery
Copy link
Contributor Author

The offending lines are these:

for read in samfile.fetch():
d[read.query_name].append(read)

@Charlie-George
Copy link
Contributor

Yep I totally agree its - far to memory consuming

I think the functions useful though, especially when we were checking we had flags correct etc, however now its stable we could take it out....

As a quick work around we added in a memory option - and my longer term plan was to rewrite it to be more memory efficent which hasn't been a top priority and won't be until after xmas, when I was planning to spend some time on this and actually split the bamfiltering step from the peakcalling pipeline into a bamprocessing pipeline for filtering, merging, downsampling bams, in which case it I think it would be usefull to have the checkbams funciton there

@IanSudbery
Copy link
Contributor Author

Thought I'd do a quick check to see quite how memory consuming it was. _ tried to load a 2GB bam file (about 15 million reads) the same way on my desktop. This has now consumed over 20GB of RAM, and is still going (very, very slowly, as the machine is now thashing the disk).

So if we are talking more than 10x the size of the BAM file, and say a simple ChIP-seq expresiment has 9 samples, you'd need to find a lot of memory for what should be a fairly simple operation.

@Acribbs
Copy link
Contributor

Acribbs commented Mar 12, 2019

What was the consensus on checkBams?

@Charlie-George
Copy link
Contributor

Its on the TODO list to change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants