Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gnomAD v4 short variants #1178

Merged
merged 106 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from 104 commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
400144e
Add v4 variants file copied from v3
mattsolo1 Sep 7, 2023
3c573bb
Add v4 variant dataset script
mattsolo1 Sep 11, 2023
c485a7b
Add v4 coverage to data-pipeline
mattsolo1 Sep 11, 2023
387f8f7
Add poetry toml, lockfile, .gitignore
mattsolo1 Sep 18, 2023
4a1f830
Add input ht types and basic validation test
mattsolo1 Sep 18, 2023
b9129b9
Add data_pipline types and helpers
mattsolo1 Sep 19, 2023
71d2b3b
Add ht schema generation workflow
mattsolo1 Sep 19, 2023
7b59a9e
Add input validation for step 1 output
mattsolo1 Sep 19, 2023
729804b
Add output validation/schemas for step 2
mattsolo1 Sep 19, 2023
7dab98f
Add io validation for step 3
mattsolo1 Sep 21, 2023
19d02a5
Use ruff for linting instead of pylint
mattsolo1 Sep 22, 2023
7bbb418
Add pyright/pytest config
mattsolo1 Sep 26, 2023
87d27cf
Develop pipeline config
mattsolo1 Sep 26, 2023
c385c33
Add working pipeline test
mattsolo1 Sep 26, 2023
eeb2353
Add PipelineMock for faking outputs from other pipelines
mattsolo1 Sep 28, 2023
c2115e9
Simplify pipeline config
mattsolo1 Sep 28, 2023
219e5da
Add mock_data test configuration
mattsolo1 Sep 29, 2023
cf7f5d2
Update dependencies
mattsolo1 Sep 29, 2023
c12c101
Move input validation out of tests, into pipeline code itself
mattsolo1 Oct 2, 2023
3586d6b
Update global input schema for new v4 mock variant dataset
mattsolo1 Oct 16, 2023
2555861
Update variant input schema for input schema for new v4 mock variant …
mattsolo1 Oct 17, 2023
0dbd8e3
Update to latest mock table 2023-10-16
mattsolo1 Oct 17, 2023
9986918
Add missing region flags
mattsolo1 Oct 17, 2023
be45026
Make pipeline config optional
mattsolo1 Oct 18, 2023
dd249f7
Update v4 variants validation and scripts
mattsolo1 Oct 18, 2023
349385f
Add development dependencies to requirements
mattsolo1 Oct 19, 2023
74d131b
Export variants and coverage to elasticsearch
mattsolo1 Oct 20, 2023
d26e828
Add combined exomes/genomes to v4 pipeline
mattsolo1 Oct 25, 2023
128a5b0
Add v4 elasticsearch queries
mattsolo1 Oct 25, 2023
fa5efdf
Update v4 short variant pipeline
mattsolo1 Oct 27, 2023
c437a03
Add v4 metadata
mattsolo1 Oct 27, 2023
c91fe78
Add graphql queries and types for gnomad v4
mattsolo1 Oct 27, 2023
e04ff5c
Add v4 to browser, gene page and variant pages working
mattsolo1 Oct 27, 2023
573770d
Configure starting selectedMetric based on dataset type
mattsolo1 Oct 28, 2023
226ea58
Make v4 the default search dataset
mattsolo1 Oct 28, 2023
0011603
Set v3 coverage as the v4 genome coverage index
mattsolo1 Oct 28, 2023
cd5bea2
Update pipeline tables to current state
mattsolo1 Oct 28, 2023
ad77f5e
Add new in silico predictor thresholds
mattsolo1 Oct 28, 2023
a34c27f
Add AC0 filter to exome variants, get transcript query working
mattsolo1 Oct 28, 2023
de45b89
Add datasetId prop to TranscriptCoverageTrack
mattsolo1 Oct 28, 2023
ac849da
Make v4 the default dataset in the router
mattsolo1 Oct 28, 2023
6553d88
Add v4 read data
mattsolo1 Oct 28, 2023
32d712b
Update read dockerfiles for building on Mx macs
mattsolo1 Oct 28, 2023
ae8256a
Update homepage examples
mattsolo1 Oct 28, 2023
93e7eb2
Add v4 to dataset selector dropdown
mattsolo1 Oct 28, 2023
7b441ee
Add redis deployment instructions to readme
mattsolo1 Oct 29, 2023
9785ba1
Add meta prefix to reads dataset config
mattsolo1 Oct 29, 2023
4224756
Fix v3 variant metrics
mattsolo1 Oct 29, 2023
8006710
Add HGDP table support for both v3 and v4 pop labels
mattsolo1 Oct 29, 2023
f4c4374
Remove console logs
mattsolo1 Oct 29, 2023
99e87c1
Add v4 gene constraint
mattsolo1 Oct 29, 2023
1b01079
Check for null in site quality metrics component
mattsolo1 Oct 30, 2023
8c1ebc5
Update liftover to v4
mattsolo1 Oct 30, 2023
ed8e58b
Fix constraint annotation issue
mattsolo1 Oct 30, 2023
f1b29b6
Fix weird runtime bugs
mattsolo1 Oct 30, 2023
23111ce
Fix home page examples
mattsolo1 Oct 30, 2023
5b19ea4
Combine variant page populations properly
mattsolo1 Oct 30, 2023
3654c00
Handle constraint flags with no description
mattsolo1 Oct 30, 2023
3e95b3c
All transcript_version as optional TranscriptConsequence field for va…
mattsolo1 Oct 30, 2023
3350774
Capitalize remaining population
mattsolo1 Oct 30, 2023
77443f1
Enable local ancestry for v4
mattsolo1 Oct 30, 2023
42c56af
Add RMC example to home page
mattsolo1 Oct 30, 2023
11e4ec6
Handle missing transcript_version on variant page
mattsolo1 Oct 30, 2023
3dc0603
Add empty array default to exome local_ancestry_populations in API
mattsolo1 Oct 30, 2023
6c8cc09
Make v4 the default short variant for v4 SVs
mattsolo1 Oct 30, 2023
f3cd8f4
use Always pull policy to help development
sjahl Oct 30, 2023
3bccd32
Add forum link to navbar
mattsolo1 Oct 30, 2023
93172fd
Add CNV example
mattsolo1 Oct 30, 2023
3cf9332
SVs/CNVs should have constraint
mattsolo1 Oct 30, 2023
3e54edd
Make ES timeout 60 temporarily
mattsolo1 Oct 30, 2023
38ff7b3
Make metadata tests pass
mattsolo1 Oct 31, 2023
38b94c0
Update paths for v4 variant datasets
mattsolo1 Oct 31, 2023
1e231bf
Rename schemas to enable diff for incoming schema
mattsolo1 Oct 31, 2023
c1a65da
Update component snapshots
mattsolo1 Oct 31, 2023
75ff4ba
Run linter and clean up pipeline files
mattsolo1 Oct 31, 2023
fb07ab2
Make Searchbox test pass by handling default v2 dataset correctly
mattsolo1 Oct 31, 2023
190e815
Make variant page tests pass, removing falsy values from variant factory
mattsolo1 Oct 31, 2023
415490e
Run formatter
mattsolo1 Oct 31, 2023
663a988
Appease linter
mattsolo1 Oct 31, 2023
5e384ca
Add faf95_joint to variant factory
mattsolo1 Oct 31, 2023
5f7756d
Component variant page component test
mattsolo1 Oct 31, 2023
bc27f30
modify configs for using a clustered redis
sjahl Jun 9, 2023
3cc6eff
document redis installation
sjahl Oct 27, 2023
c08d240
update to ioredis for rate limit db
sjahl Oct 27, 2023
8a4c959
Support mulpitle connection modes
sjahl Oct 30, 2023
f8b265e
Update Downloads v4, with a few placeholders
rileyhgrant Oct 30, 2023
5b56e91
deploy/manifests/elasticsearch/elasticsearch.yaml.jinja2
sjahl Oct 18, 2023
2708f23
Handle missing age data and allele balance on variant page
mattsolo1 Oct 31, 2023
6a24c20
Fix in silico scores based on Qin's feedback
mattsolo1 Oct 31, 2023
ad47b82
Update read gencode preparation & disk preparation instructions
mattsolo1 Oct 31, 2023
5143353
Update readviz frontend config
mattsolo1 Oct 31, 2023
42a461e
Update GRCh38 ClinVar pipeline
rileyhgrant Nov 1, 2023
19e563d
Address review
mattsolo1 Nov 1, 2023
fc46cd7
Update v4 downloads with release links
rileyhgrant Nov 1, 2023
ec6aaec
Remove v4 constraint section on Downloads page
rileyhgrant Nov 1, 2023
bdbff7c
Update v4 variant queries to include all variants
rileyhgrant Nov 1, 2023
7dba414
Update Stats page per comments
rileyhgrant Oct 31, 2023
b8a26d7
Update About page per comments
rileyhgrant Oct 31, 2023
5185cf0
Update Help/FAQ page per comments
rileyhgrant Oct 31, 2023
ade1105
Update Home page per comments
rileyhgrant Oct 31, 2023
581a63c
Update stylings of Gene Page button
rileyhgrant Nov 1, 2023
cb9700b
Remove v4 from hasConstraint metadata
rileyhgrant Nov 1, 2023
0429073
Remove references to v4 constraint from Help page
rileyhgrant Nov 1, 2023
6084fdf
Add link to v2 if constraint is missing
mattsolo1 Nov 1, 2023
84605c6
switch to a readwritemany volume for cache
sjahl Nov 1, 2023
4d411a2
bump rate limits for cache warming
sjahl Nov 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .github/workflows/data-pipeline-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.7
python-version: 3.9
phildarnowsky-broad marked this conversation as resolved.
Show resolved Hide resolved
- name: Use pip cache
uses: actions/cache@v2
with:
Expand All @@ -29,10 +29,10 @@ jobs:
- name: Install dependencies
run: |
pip install wheel
pip install -r requirements-dev.txt
pip install hail
pip install -r data-pipeline/requirements.txt
- name: Check formatting
run: black --check data-pipeline/src/data_pipeline
- name: Run Pylint
run: pylint --disable=fixme data-pipeline/src/data_pipeline
- name: Run Ruff
run: ruff data-pipeline/src/data_pipeline
- name: Run Pyright
run: pyright --project data-pipeline
4 changes: 2 additions & 2 deletions browser/about/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The aggregation and release of summary data from the exomes and genomes collecte
## Stats

- v4 release is composed of 730,947 exomes and 76,215 genomes (GRCh38)
- gnomAD v4 structural variants (SV) represent 63,057 genomes (GRCh38)
- gnomAD v4 copy number variants (CNV) represent variants in less than 1% of 464,566 exomes (GRCh38)
- gnomAD v4 structural variants (SV) represent 63,046 genomes (GRCh38)
- gnomAD v4 copy number variants (CNV) represent variants in less than 1% of 464,297 exomes (GRCh38)

For more Stats on gnomAD v4 please see our [stats page](/stats)
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
question: 'How was the expected number of variants determined?'
---

We used a mutational model that accounts for local sequence context, CpG methylation, and sequencing depth to predict the number of expected single nucleotide variants per functional class per gene. More details can be found in the help section on [gene constraint](/help/constraint) and in [Karczewski _et al._ Nature 2020](https://doi.org/10.1038/s41586-020-2308-7). Note that the expected variant counts for bases with a median depth <1 were removed from the totals. In v4, we applied our mutational model only to sites with a median depth in the exomes ≥30.
We used a mutational model that accounts for local sequence context, CpG methylation, and sequencing depth to predict the number of expected single nucleotide variants per functional class per gene. More details can be found in the help section on [gene constraint](/help/constraint) and in [Karczewski _et al._ Nature 2020](https://doi.org/10.1038/s41586-020-2308-7). Note that the expected variant counts for bases with a median depth <1 were removed from the totals.
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
question: 'What are the fields included in constraint files?'
---

Descriptions of the fields in these files can be found in the [README file](/downloads#v4-variants) supplied with the download.
Descriptions of the fields in these files can be found in the Supplementary Dataset 11 section on pages 74-77 of the [Supplementary Information](https://www.nature.com/articles/s41586-020-2308-7#Sec12) of [_The mutational constraint spectrum quantified from variation in 141,456 humans._ Nature 581, 434–443 (2020)](https://doi.org/10.1038/s41586-020-2308-7).
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,14 @@
question: 'Why are constraint metrics missing for this gene or annotated with a note?'
---

Genes that were outliers in certain assessments will not have constraint metrics or will be flagged with a note warning of various error modes. Please note that these assessments were applied to the canonical transcripts of the genes. If a gene was not annotated as a protein-coding gene in GENCODE v19, we did not calculate constraint. The following list describes the reason names given in the constraint_flag column of the [constraint files](/downloads#v4-constraint):
Genes that were outliers in certain assessments will not have constraint metrics or will be flagged with a note warning of various error modes. Please note that these assessments were applied to the canonical transcripts of the genes. If a gene was not annotated as a protein-coding gene in GENCODE v19, we did not calculate constraint. The following list describes the reason names given in the constraint_flag column of the [constraint files](/downloads#v2-constraint):

- `no_variants`: Zero observed synonymous, missense, pLoF variants
- `no_exp_lof`: Zero expected pLoF variants
- `outlier_lof`: Number of pLoF variants is significantly different than expectation
- `no_exp_mis`: Zero expected missense variants
- `outlier_mis`: Number of missense variants is significantly different than expectation
- `no_exp_syn`: Zero expected synonymous variants
- `outlier_syn`: Number of synonymous variants is significantly different than expectation
- no_variants: Zero observed synonymous, missense, pLoF variants
- no_exp_lof: Zero expected pLoF variants
- lof_too_many: More pLoF variants than expected
- no_exp_mis: Zero expected missense variants
- mis_too_many: More missense variants than expected
- no_exp_syn: Zero expected synonymous variants
- syn_outlier: More or fewer synonymous variants than expected

Possible reasons that one might observe the deviations listed above include mismapped reads due to homologous regions or poor quality sequencing data.

Currently, constraint scores are only available for autosomes. We will release scores for chromosomes X in the near future.
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
question: 'Why are there fewer variants in the constraint table than displayed on the gene page?'
---

We only included single nucleotide variants that were found in the MANE Select (v3 and v4 on GRCh38) or canonical (ExAC and v2 on GRCh37/hg19) transcript of the gene. On the gene page, variants found in all transcripts are displayed. Additionally, both observed and expected variant counts were removed for sites with a median depth < 30.
We only included single nucleotide variants that were found in the canonical (ExAC and v2 on GRCh37/hg19) transcript of the gene. On the gene page, variants found in all transcripts are displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@
question: 'What features are not yet in v4 and where can I find them?'
---

The v4.0 release is a minimum viable product (MVP) release, which allows us to get the most critical piece of the gnomAD database, high quality aggregate allele frequencies and updated constraint metrics, to our users as soon as possible. It also means that a few of the existing features found in v2 or v3 are not yet included in v4 but **will be coming soon**.
The v4.0 release is a minimum viable product (MVP) release, which allows us to get the most critical piece of the gnomAD database, high quality aggregate allele frequencies, to our users as soon as possible. It also means that a few of the existing features found in v2 or v3 are not yet included in v4 but **will be coming soon**.

Below is a list of all features not included in the v4 MVP and where to find them in our past datasets until we are able to add them to v4:

<br />

| Non MVP feature | Past versions with this data |
| ----------------------------------------------- | ----------------------------------------------------- |
| Gene constraint | v2 gene page |
| Pext score | v2 gene page |
| Sub-genetic ancestry groups (prevously subpops) | v2 variant page |
| Multi Nucleotide (MNV) calls | v2 variant table and variant page |
Expand Down
Loading
Loading