Fix broken gCNV Annotation pipeline #960

MattWellie · 2024-10-28T10:08:32Z

Problem!

We can't complete gCNV runs anymore, they're broken

Reason!

See https://batch.hail.populationgenomics.org.au/batches/518770/jobs/358 as an example - nothing but a miserable Hail dump exception.

@EddieLF tracked the issue to the worker batch, e.g. https://batch.hail.populationgenomics.org.au/batches/518781/jobs/4

Caused by: com.fasterxml.jackson.core.exc.StreamConstraintsException: 
String length (20051112) exceeds the maximum length (20000000)

I did some poking around and tracked the problem down to here. In the gCNV and GATK-SV workflows we load up a GTF file from Gencode, parse it as a dictionary, then use that dict to update the gene symbols to ENSGs. I guess this dictionary grows over time, but it now contains upwards of 40k entries, each being a key: value mapping of long Strings. This exceeds the 20MB cap in Spark when creating and evaluating expressions.

This change covers a few things:

We currently get the Gencode GTF file each time AnnotateCohort runs in the gCNV or GATK-SV pipelines, copy it as a local file, then throw it away. This adds a script we can run to get a specific version of this file and copy it into the common bucket. It's added as a config entry for the gCNV and GATK-SV pipelines, which should save a few mins each run.
When parsing this file there's an optional argument for chunk_size. Instead of returning a single dictionary, this gives more granular control to break that monolith up into smaller dicts
Using this collection of smaller dicts in the gCNV pipeline, we annotate with each fragment in turn, then write a checkpoint. This forces Hail's lazy evaluation, so the size of the expression being passed around is tiny. The MTs in general about only a few MB in size, so there's no issues with writing multiple checkpoints. I don't think GATK-SV has failed yet, so I've not made the same edits there.
The AnnotateCohort/Dataset script is currently run using query_command, and I really hate that, so I've written a command line ArgParse entrypoint, and this will now work as a standalone script.

codecov-commenter · 2024-10-28T10:15:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.42%. Comparing base (e679675) to head (44d02de).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #960   +/-   ##
=======================================
  Coverage   78.42%   78.42%           
=======================================
  Files          10       10           
  Lines        1794     1794           
=======================================
  Hits         1407     1407           
  Misses        387      387

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

setup.py

Co-authored-by: EddieLF <34049565+EddieLF@users.noreply.github.com>

EddieLF

Approving! The solution to the size issue sounds really good, and the code changes look fine. I'm not deeply familiar with this pipeline, but if it's currently broken then it's only up from here right? Thanks Matt

MattWellie added 5 commits October 28, 2024 17:15

Add the get-gencode script

7df5376

use the pre-downloaded gencode data in gCNV and SV

902fd46

Bump version: 1.29.4 → 1.29.5

1354531

candidate changes

16ef4bb

get rid of another query_command

a290d62

MattWellie requested review from cassimons and EddieLF October 28, 2024 10:08

LINT

10bb940

EddieLF reviewed Oct 28, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Update setup.py

e2a4a42

Co-authored-by: EddieLF <34049565+EddieLF@users.noreply.github.com>

EddieLF approved these changes Oct 28, 2024

View reviewed changes

remove a dud import

44d02de

MattWellie merged commit 8240817 into main Oct 28, 2024
4 checks passed

MattWellie deleted the get_gencode_annotation_file_once branch October 28, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken gCNV Annotation pipeline #960

Fix broken gCNV Annotation pipeline #960

MattWellie commented Oct 28, 2024 •

edited

Loading

codecov-commenter commented Oct 28, 2024 •

edited

Loading

EddieLF left a comment

Fix broken gCNV Annotation pipeline #960

Fix broken gCNV Annotation pipeline #960

Conversation

MattWellie commented Oct 28, 2024 • edited Loading

Problem!

Reason!

codecov-commenter commented Oct 28, 2024 • edited Loading

Codecov Report

EddieLF left a comment

Choose a reason for hiding this comment

MattWellie commented Oct 28, 2024 •

edited

Loading

codecov-commenter commented Oct 28, 2024 •

edited

Loading