High memory usage #15

mbhall88 · 2020-08-10T06:11:16Z

Varifier's memory usage seems quite excessive. For example, I had a ~350Mb VCF that took 13Gb RAM to complete (the most I've seen so far is 21Gb).

Here is an idea of where the problem lies (produced by memory_profiler)

134    238.9 MiB      0.0 MiB       recall_stats = {

135    238.9 MiB      0.0 MiB           "ALL": recall_stats_all["ALL"],

136    238.9 MiB      0.0 MiB           "FILT": recall_stats_filtered["ALL"],

137                                 }

139  12258.4 MiB  12019.5 MiB       per_record_precision = vcf_stats.per_record_stats_from_vcf_file(vcf_for_precision)

140  12258.4 MiB      0.0 MiB       precision_stats = vcf_stats.summary_stats_from_per_record_stats(

141  12258.4 MiB      0.0 MiB           per_record_precision

142                                 )

Is there a way we could be more efficient with the way we get per-record stats?

The text was updated successfully, but these errors were encountered:

mbhall88 · 2020-08-10T06:20:04Z

Seems like we load the whole VCF in memory here

varifier/varifier/vcf_stats.py

Line 70 in 04105de

header_lines, vcf_records = vcf_file_read.vcf_file_to_list(infile)

and then we create a nested dictionary for each record.

martinghunt · 2020-08-10T07:53:04Z

Yes, could rewrite that file to not load the VCF into memory. Wasn't expecting such big VCF files.

leoisl · 2020-08-10T11:28:54Z

I am thinking about the simplest way to deal with this memory issue. Could we have a new function that returns the header_lines and vcf_records as generators (then the original function just cast the generators to list and return the lists)? Or have a parameter (default initialised to False) in the vcf_file_read.vcf_file_to_list() function that specifies us if we want a generator instead of a list? I think like this we can solve this memory issue without breaking backwards compatibility (as I know many of Martin's (and others') tools rely on cluster_vcf_records)

martinghunt · 2020-08-10T12:31:12Z

Sounds good to me @leoisl . I was thinking similar. Definitely can't break backwards compatibility. That would break minos (and maybe gramtools), which needs things in memory because of all the VCF record merging it does.

mbhall88 · 2020-08-11T04:53:55Z

I don't think that solves the memory issue though. It's not necessarily reading the VCF into memory that's causing the memory explosion, I think it's also creating nested dictionaries for each record that does it?
But, not reading the whole VCF into memory will certainly reduce the memory.
I'll have a play today and see.

martinghunt · 2020-08-11T07:45:22Z

Yes, was thinking iterate over the VCF, update a final dict of stats. Don't store any of the intermediate dicts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage #15

High memory usage #15

mbhall88 commented Aug 10, 2020

mbhall88 commented Aug 10, 2020

martinghunt commented Aug 10, 2020

leoisl commented Aug 10, 2020

martinghunt commented Aug 10, 2020

mbhall88 commented Aug 11, 2020 •

edited

Loading

martinghunt commented Aug 11, 2020

High memory usage #15

High memory usage #15

Comments

mbhall88 commented Aug 10, 2020

mbhall88 commented Aug 10, 2020

martinghunt commented Aug 10, 2020

leoisl commented Aug 10, 2020

martinghunt commented Aug 10, 2020

mbhall88 commented Aug 11, 2020 • edited Loading

martinghunt commented Aug 11, 2020

mbhall88 commented Aug 11, 2020 •

edited

Loading