Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken gCNV Annotation pipeline #960

Merged
merged 8 commits into from
Oct 28, 2024

Conversation

MattWellie
Copy link
Contributor

@MattWellie MattWellie commented Oct 28, 2024

Problem!

  • We can't complete gCNV runs anymore, they're broken

Reason!

See https://batch.hail.populationgenomics.org.au/batches/518770/jobs/358 as an example - nothing but a miserable Hail dump exception.

@EddieLF tracked the issue to the worker batch, e.g. https://batch.hail.populationgenomics.org.au/batches/518781/jobs/4

Caused by: com.fasterxml.jackson.core.exc.StreamConstraintsException: 
String length (20051112) exceeds the maximum length (20000000)

I did some poking around and tracked the problem down to here. In the gCNV and GATK-SV workflows we load up a GTF file from Gencode, parse it as a dictionary, then use that dict to update the gene symbols to ENSGs. I guess this dictionary grows over time, but it now contains upwards of 40k entries, each being a key: value mapping of long Strings. This exceeds the 20MB cap in Spark when creating and evaluating expressions.

This change covers a few things:

  1. We currently get the Gencode GTF file each time AnnotateCohort runs in the gCNV or GATK-SV pipelines, copy it as a local file, then throw it away. This adds a script we can run to get a specific version of this file and copy it into the common bucket. It's added as a config entry for the gCNV and GATK-SV pipelines, which should save a few mins each run.
  2. When parsing this file there's an optional argument for chunk_size. Instead of returning a single dictionary, this gives more granular control to break that monolith up into smaller dicts
  3. Using this collection of smaller dicts in the gCNV pipeline, we annotate with each fragment in turn, then write a checkpoint. This forces Hail's lazy evaluation, so the size of the expression being passed around is tiny. The MTs in general about only a few MB in size, so there's no issues with writing multiple checkpoints. I don't think GATK-SV has failed yet, so I've not made the same edits there.
  4. The AnnotateCohort/Dataset script is currently run using query_command, and I really hate that, so I've written a command line ArgParse entrypoint, and this will now work as a standalone script.

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.42%. Comparing base (e679675) to head (44d02de).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #960   +/-   ##
=======================================
  Coverage   78.42%   78.42%           
=======================================
  Files          10       10           
  Lines        1794     1794           
=======================================
  Hits         1407     1407           
  Misses        387      387           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

setup.py Outdated Show resolved Hide resolved
Co-authored-by: EddieLF <34049565+EddieLF@users.noreply.github.com>
Copy link
Contributor

@EddieLF EddieLF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving! The solution to the size issue sounds really good, and the code changes look fine. I'm not deeply familiar with this pipeline, but if it's currently broken then it's only up from here right? Thanks Matt

@MattWellie MattWellie merged commit 8240817 into main Oct 28, 2024
4 checks passed
@MattWellie MattWellie deleted the get_gencode_annotation_file_once branch October 28, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants