Add gnomAD v4 short variants #1178

mattsolo1 · 2023-09-15T17:40:23Z

Addition of gnomAD v4 short variants

Major update to add support for v4 short variants to browser.

Pipeline changes

This PR includes some updates to the gnomAD pipeline tool chain, tests, and modifications to the pipeline configuration. There is also the start of the v4 variant/coverage pipeline.

Still a draft & not yet confirmed for breaking changes to other pipelines; but feel free to look and provide initial feedback

Tool chain changes

Dependency management with poetry

Previously we had only a requirements.txt

Now dependencies installed with poetry, kept in pyproject.toml

This creates a lock file with exact versions.

Then generate requirements.txt with:

poetry export --with dev >requirements.txt

or by running update-requirements.sh

UPDATE: you don't have to use poetry if you don't touch dependencies; just use requirements.txt.

Linting

Pylint -> Ruff

Faster
More tools in 1 (e.g., flake8 + isort)
Used by many open source projects
Works better with editors

https://docs.astral.sh/ruff/

Black

I haven't done this yet, but I'd suggest decreasing line length 120 -> 88?

Type checking & input validation

pyright
attrs (https://www.attrs.org/en/stable/)

Classes are defined with @attr.define

Example:

import attr

@attr.define
class Post:
    text: str

@attr.define
class User:
    user_id: int
    username: str
    email: str
    age: int
    posts: List[Post]


new_user = User(
    user_id=1, 
    username="Alice",
    email="alice@example.com",
    age=25,
    status="active")  # pyright will raise a type error here

Advantages:

Nice editor experience!

More readable (compared to using dictionaries or having no types).

Use create() to instantiate classes instead of __init__

Use structure and unstructure utility functions to:

deserialize/serialize
validate inputs
Make nested types

Logging

Suggestion use Loguru instead of std lib logging (see their website for the advantages).

Input/output validation

Using ChatGPT, hail schemas, and attrs/cattrs libraries, it is possible to quickly create classes that can be used to validate pipeline I/O.

e.g.

@attr.define
class Step3Variant:
    locus: Locus
    alleles: list[str]
    grpmax: List[Grpmax]
    rsids: Union[Set[str], None]
    rf: Rf
    in_silico_predictors: InSilicoPredictors
    variant_id: str
    colocated_variants: ColocatedVariants
    gnomad: Gnomad # gnomad-specific data
    subsets: set[str]
    flags: set[str]
    transcript_consequences: Union[List[TranscriptConsequence], None]

Then validate:

def test_validate_step3_output():
    output_path = gnomad_v4_variant_pipeline.get_task(
        "annotate_gnomad_v4_exome_transcript_consequences"
    ).get_output_path()
    ht = hl.read_table(output_path)
    result = ht_to_json(ht)
    [structure_attrs_fromdict(variant, Step3Variant) for variant in result]

Checks for missing data too.

Comitting hail schemas to source control

write_schemas.py

Iterates through all hail inputs/outputs for a given pipeline & writes them to schemas dir.

e.g.

/schemas/gnomad_v4_variants/annotate_gnomad_v4_exome_transcript_consequences/output/gnomad_v4_variants_annotated_2.ht.schema

----------------------------------------
Global fields:
    'freq_meta': array<dict<str, str>> 
    'freq_index_dict': dict<str, int32> 
    'faf_meta': array<dict<str, str>> 
    'faf_index_dict': dict<str, int32> 
    'freq_sample_count': array<int32> 
    'filtering_model': struct {
        model_name: str, 
        score_name: str, 
        feature_medians: dict<tuple (
            str
        ), struct {
etc

This is analogous to our jest snapshots. If anything in the pipeline changes, the schema will be tracked in source control. And it's nice to have as a reference.

https://github.com/broadinstitute/gnomad-browser/pull/1178/files#diff-0811404f3bc0e96d8cd34bae975ef717b89943056daace9d6a5c280515412762R41-R54

Modifications to core Pipeline class

Improved IO config

Previously, one would run a pipeline with an --output-root, a required arg.

Data inputs were hardcoded, or relied on other pipelines that may not exist.

Makes it hard to develop pipelines in isolation.

The Pipeline class changed to make this an optional arg, accepts a config object that allows specifying both input & output dirs.

A new PipelineMock class makes it possible to stub other pipeline outputs.

This makes it possible to easily switch between local dev datasets & prod full datasets.

rileyhgrant

This is looking quite cool, lots of great stuff that I'm excited to get acquainted with.

In general I'm pretty excited about the typing you've been adding, in addition to the making it far easier to create tests for pipelines locally with the dataset config.

In regards to the tooling:

Poetry, I've had a few difficulties with this so far, likely some significant amount of this is user error and/or some quirk of my machine, but Poetry does not consistently activate a virtual environment with poetry shell for me, even to the point of telling me its active when running which python shows that is most certainly is not active. So that's just to say I'm glad that at least if I'm not messing with dependencies it still can output a generic requirements.txt file. I'm curious to hear Phil's thoughts on Poetry.
Ruff, I'm sold
Black, I'm sold on shortening the line length. It appears as though 88 is their default value, so I'm on board with using their choice, as they're in the business of having opinions.
Type checking with pyright + attrs, this syntax does seem nice to quickly define types.
Loguru, I'm sold.
I/O validation with types, yet again, I'm sold, seems like a one time big cost then a bit of upkeep for huge benefits
Commiting Hail schemas to Source Control, this seems like it could have some nice benefits, what is the intended workflow with this? Is it to locally run the python script to generate the schemas, then check for a diff in between the old one and the new one, before running a pipeline? In that case do we remember to manually run this on our machine, and check these new ones into source control, then having some CI check for a difference a la automated Jest tests? Or is this more just to always be able to reference the old schema, I know I've spent enough time making a tiny script to run describe on a dataset because I don't have the schema handy
Dataset Configs, I see how this is useful for being able to generate tests for entire pipelines. Would a small test for a given pipeline require replicating the entire pipeline in a test file (running all the same tasks in the same order) with a different dataset config that dictates a different input and output? Or is the benefit here being able to test the code for the actual pipeline at all, not what a particular pipeline does in its own logic?

In general, I haven't gotten my hands too dirty with these proposed changes in a firsthand way yet, but I'm excited about the suggestions and to learn from the knowledge you've gained from other projects.

mattsolo1 · 2023-10-06T16:41:32Z

Thank you for the feedback, Riley!

Poetry, I've had a few difficulties with this so far

Interesting. Poetry isn't a hard requirement for me, but I do find it makes it significantly easier to keep dependencies up to date and to know exactly what versions are being used. I'm happy to remove this if people would like, in favor for a simpler requirements.txt without pinned versions.

Committing Hail schemas to Source Control, this seems like it could have some nice benefits, what is the intended workflow with this?

do we remember to manually run this on our machine, and check these new ones into source control, then having some CI check for a difference a la automated Jest tests

Basically yes I would see us having a shared "test dataset" in GCS somewhere that is downloaded locally with a pipeline step, the pipeline is executed, and then a diff is checked in CI. Ideally this is not strictly manual; the idea is to mimic Jest-style snapshots. And yes having a browsable schema in source is a pretty significant bonus, IMHO.

Would a small test for a given pipeline require replicating the entire pipeline in a test file (running all the same tasks in the same order) with a different dataset config that dictates a different input and output? Or is the benefit here being able to test the code for the actual pipeline at all, not what a particular pipeline does in its own logic?

Not quite sure what is being asked here, but happy to chat in person to clarify. I think it's the later; having a systematic way to run a small dataset through the pipeline is useful for prototyping and making changes.

phildarnowsky-broad

🎉

mattsolo1 added 8 commits September 7, 2023 14:38

Add v4 variants file copied from v3

400144e

Add v4 variant dataset script

3c573bb

Add v4 coverage to data-pipeline

c485a7b

Add poetry toml, lockfile, .gitignore

387f8f7

Add input ht types and basic validation test

4a1f830

Add data_pipline types and helpers

b9129b9

Add ht schema generation workflow

71d2b3b

Add input validation for step 1 output

7b59a9e

mattsolo1 force-pushed the gnomad-v4 branch from 6a86005 to b778919 Compare September 19, 2023 17:43

mattsolo1 added 2 commits September 19, 2023 14:30

Add output validation/schemas for step 2

729804b

Add io validation for step 3

7dab98f

mattsolo1 force-pushed the gnomad-v4 branch from 51cf678 to aed81ab Compare September 21, 2023 14:26

Use ruff for linting instead of pylint

19d02a5

mattsolo1 force-pushed the gnomad-v4 branch 2 times, most recently from 604f868 to ab01ef3 Compare September 22, 2023 17:26

mattsolo1 added 2 commits September 26, 2023 12:37

Add pyright/pytest config

7bbb418

Develop pipeline config

87d27cf

mattsolo1 force-pushed the gnomad-v4 branch from f52e1a5 to 620d360 Compare September 26, 2023 16:46

mattsolo1 added 3 commits September 26, 2023 13:56

Add working pipeline test

c385c33

Add PipelineMock for faking outputs from other pipelines

eeb2353

Simplify pipeline config

c2115e9

mattsolo1 requested review from rileyhgrant and phildarnowsky-broad September 28, 2023 19:36

mattsolo1 changed the title ~~Add gnomAD v4 variant and coverage pipeline~~ Pipeline tool changes for gnomAD v4 Sep 28, 2023

mattsolo1 added 3 commits September 29, 2023 10:57

Add mock_data test configuration

219e5da

Update dependencies

cf7f5d2

Move input validation out of tests, into pipeline code itself

c12c101

mattsolo1 force-pushed the gnomad-v4 branch from 2fca2aa to d6221b0 Compare October 2, 2023 15:18

rileyhgrant reviewed Oct 3, 2023

View reviewed changes

rileyhgrant and others added 10 commits November 1, 2023 07:33

Remove v4 constraint section on Downloads page

ec6aaec

Update v4 variant queries to include all variants

bdbff7c

Update Stats page per comments

7dba414

Update About page per comments

b8a26d7

Update Help/FAQ page per comments

5185cf0

Update Home page per comments

ade1105

Update stylings of Gene Page button

581a63c

Remove v4 from hasConstraint metadata

cb9700b

Remove references to v4 constraint from Help page

0429073

Add link to v2 if constraint is missing

6084fdf

phildarnowsky-broad approved these changes Nov 1, 2023

View reviewed changes

sjahl added 2 commits November 1, 2023 09:42

switch to a readwritemany volume for cache

84605c6

bump rate limits for cache warming

4d411a2

mattsolo1 merged commit 7f39d7e into main Nov 1, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gnomAD v4 short variants #1178

Add gnomAD v4 short variants #1178

mattsolo1 commented Sep 15, 2023 •

edited

Loading

rileyhgrant left a comment •

edited

Loading

mattsolo1 commented Oct 6, 2023

phildarnowsky-broad left a comment

Add gnomAD v4 short variants #1178

Add gnomAD v4 short variants #1178

Conversation

mattsolo1 commented Sep 15, 2023 • edited Loading

Addition of gnomAD v4 short variants

Pipeline changes

Tool chain changes

Dependency management with poetry

Linting

Black

Type checking & input validation

Logging

Input/output validation

Comitting hail schemas to source control

Modifications to core Pipeline class

Improved IO config

rileyhgrant left a comment • edited Loading

Choose a reason for hiding this comment

mattsolo1 commented Oct 6, 2023

phildarnowsky-broad left a comment

Choose a reason for hiding this comment

mattsolo1 commented Sep 15, 2023 •

edited

Loading

rileyhgrant left a comment •

edited

Loading