Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage #15

Open
mbhall88 opened this issue Aug 10, 2020 · 6 comments
Open

High memory usage #15

mbhall88 opened this issue Aug 10, 2020 · 6 comments

Comments

@mbhall88
Copy link
Member

Varifier's memory usage seems quite excessive. For example, I had a ~350Mb VCF that took 13Gb RAM to complete (the most I've seen so far is 21Gb).

Here is an idea of where the problem lies (produced by memory_profiler)

134    238.9 MiB      0.0 MiB       recall_stats = {

135    238.9 MiB      0.0 MiB           "ALL": recall_stats_all["ALL"],

136    238.9 MiB      0.0 MiB           "FILT": recall_stats_filtered["ALL"],

137                                 }

139  12258.4 MiB  12019.5 MiB       per_record_precision = vcf_stats.per_record_stats_from_vcf_file(vcf_for_precision)

140  12258.4 MiB      0.0 MiB       precision_stats = vcf_stats.summary_stats_from_per_record_stats(

141  12258.4 MiB      0.0 MiB           per_record_precision

142                                 )

Is there a way we could be more efficient with the way we get per-record stats?

@mbhall88
Copy link
Member Author

Seems like we load the whole VCF in memory here

header_lines, vcf_records = vcf_file_read.vcf_file_to_list(infile)

and then we create a nested dictionary for each record.

@martinghunt
Copy link
Member

Yes, could rewrite that file to not load the VCF into memory. Wasn't expecting such big VCF files.

@leoisl
Copy link

leoisl commented Aug 10, 2020

I am thinking about the simplest way to deal with this memory issue. Could we have a new function that returns the header_lines and vcf_records as generators (then the original function just cast the generators to list and return the lists)? Or have a parameter (default initialised to False) in the vcf_file_read.vcf_file_to_list() function that specifies us if we want a generator instead of a list? I think like this we can solve this memory issue without breaking backwards compatibility (as I know many of Martin's (and others') tools rely on cluster_vcf_records)

@martinghunt
Copy link
Member

Sounds good to me @leoisl . I was thinking similar. Definitely can't break backwards compatibility. That would break minos (and maybe gramtools), which needs things in memory because of all the VCF record merging it does.

@mbhall88
Copy link
Member Author

mbhall88 commented Aug 11, 2020

I don't think that solves the memory issue though. It's not necessarily reading the VCF into memory that's causing the memory explosion, I think it's also creating nested dictionaries for each record that does it?
But, not reading the whole VCF into memory will certainly reduce the memory.
I'll have a play today and see.

@martinghunt
Copy link
Member

Yes, was thinking iterate over the VCF, update a final dict of stats. Don't store any of the intermediate dicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants