-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken gCNV Annotation pipeline #960
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #960 +/- ##
=======================================
Coverage 78.42% 78.42%
=======================================
Files 10 10
Lines 1794 1794
=======================================
Hits 1407 1407
Misses 387 387 ☔ View full report in Codecov by Sentry. |
Co-authored-by: EddieLF <34049565+EddieLF@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving! The solution to the size issue sounds really good, and the code changes look fine. I'm not deeply familiar with this pipeline, but if it's currently broken then it's only up from here right? Thanks Matt
Problem!
Reason!
See https://batch.hail.populationgenomics.org.au/batches/518770/jobs/358 as an example - nothing but a miserable Hail dump exception.
@EddieLF tracked the issue to the worker batch, e.g. https://batch.hail.populationgenomics.org.au/batches/518781/jobs/4
I did some poking around and tracked the problem down to here. In the gCNV and GATK-SV workflows we load up a GTF file from Gencode, parse it as a dictionary, then use that dict to update the gene symbols to ENSGs. I guess this dictionary grows over time, but it now contains upwards of 40k entries, each being a key: value mapping of long Strings. This exceeds the 20MB cap in Spark when creating and evaluating expressions.
This change covers a few things:
chunk_size
. Instead of returning a single dictionary, this gives more granular control to break that monolith up into smaller dicts