diff --git a/browser/help/helpPageTableOfContents.ts b/browser/help/helpPageTableOfContents.ts index 0cf8ccb54..43fb05cc6 100644 --- a/browser/help/helpPageTableOfContents.ts +++ b/browser/help/helpPageTableOfContents.ts @@ -19,6 +19,7 @@ const helpPageTableOfContents: { topics: string[]; faq: FaqTopic[] } = { 'variant-cooccurrence', 'hgdp-1kg-annotations', 'v4-hts', + 'v4-browser-hts', 'exome-capture-tech', 'combined-freq-stats', 'allele-count-zero', diff --git a/browser/help/topics/v4-browser-hts.md b/browser/help/topics/v4-browser-hts.md new file mode 100644 index 000000000..61dd76253 --- /dev/null +++ b/browser/help/topics/v4-browser-hts.md @@ -0,0 +1,249 @@ +--- +id: v4-browser-hts +title: 'gnomAD v4 Browser Hail Tables' +--- + +In addition to our [variants tables](/downloads#v4-variants), we release two data tables underlying the gnomAD browser. These tables enable our users to more easily incorporate gnomAD data into external pipelines and analyses in a manner consistent with what they see in the browser. + +## gnomAD v4.1 exome/genome/joint variant table + +To convert the standard gnomAD variant release tables into a format more suitable for browser display, we join the exome, genome, and joint tables on locus/allele to create a single table. This process ensures that they share the same site-level annotations, thus saving space and optimizing database/API queries. Additionally, allele counts and frequencies are structured in a JSON-like format that is more easily consumable by web applications. The table may also include subset data not visible in the browser. + +Each row (i.e., variant) in this table will have distinct allele frequency information and quality metrics depending whether it was present in the exome or genome callsets but will share common annotations such as [VEP annotations](https://useast.ensembl.org/info/docs/tools/vep/index.html) and _in silico_ predictors. + +The script for how this table is created can be found [here](https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/pipelines/gnomad_v4_variants.py). + +## gnomAD v4.1/v2.1 genes tables + +These tables underlie the gene models data seen in the browser, which contains detailed information on exon-coding regions, transcripts, identifiers, gene constraint, and co-occurrence data. The data from these tables are derived from [GENCODE](https://www.gencodegenes.org/human/release_39.html), the [HUGO Gene Nomenclature Committee (HGNC)](https://www.genenames.org/), [MANE transcripts](https://www.ncbi.nlm.nih.gov/refseq/MANE/), [GTEx](https://gtexportal.org/home/) (coming soon), and from gnomAD secondary analyses. + +The script for how this table is created can be found [here](https://github.com/broadinstitute/gnomad-browser/blob/main/data-pipeline/src/data_pipeline/pipelines/genes.py). + +# Browser Hail Table Field Descriptions + +#### gnomAD v4.1 browser variant Hail Table annotations + +Global fields: + +- `mane_select_version`: MANE Select version used to annotate variants. + +Row fields: + +- `locus`: Variant locus. Contains contig and position information. +- `alleles`: Variant alleles. +- `exome`: Struct containing information about variant from exome data. + - `colocated_variants`: Struct containing array of variants located at the same Locus as this variant, e.g. for the variant `1-55051215-G-GA`, the variants `1-55051215-G-A` and `1-55051215-G-T` are colocated. + - `all`: An array containing colocated variants that are present in the entire exome dataset. + - `non_ukb`: An array containing colocated variants that are present in the non-UK Biobank (UKB) subset of the dataset. + - `subsets`: A set containing the subsets this variant is seen in. + - `flags`: A set containing the flags about the region this variant falls in. See `region_flags` description on the v4 Hail Tables [help page](v4-hts#region-flags). + - `freq`: A struct containing variant frequency information for each subset. + - `all`: Struct containing variant frequency information calculated across all samples. + - `ac`: The alternate allele count for this variant calculated across high-quality genotypes (genotypes with depth >= 10, genotype quality >= 20 and minor allele balance > 0.2 for heterozygous genotypes). This is the allele count displayed in the gnomAD browser (not `ac_raw` below). + - `ac_raw`: The alternate allele count for this variant calculated across unadjusted genotypes. + - `an`: Total number of alleles for this locus. + - `hemizygote_count`: Number of hemizygous alternate individuals. + - `homozygote_count`: Number of homozygous alternate individuals. + - `ancestry_groups`: Array containing variant frequency information stratified per genetic ancestry group. + - `id`: Three letter identifier for this genetic ancestry group, e.g. `amr` or `sas`. + - `ac`: Alternate allele count for this variant in this genetic ancestry group. + - `an`: Total number of alleles for this locus for this genetic ancestry group. + - `hemizygote_count`: Number of hemizygous alternate individuals in this genetic ancestry group. + - `homozygote_count`: Number of homozygous alternate individuals in this genetic ancestry group. + - `non_ukb`: Struct containing variant frequency information from the non-UKB subset. Includes same fields as above struct (`all`). + - `fafmax`: Struct containing information about the maximum FAF. + - `gnomad`: Struct containing information about the fafmax for all of gnomad for the exome data. + - `faf95_max`: Max FAF value for the (95% CI). + - `faf95_max_gen_anc`: Genetic ancestry group associated with the grpmax FAF (95% CI). + - `faf99_max`: Max FAF(99% CI). + - `faf99_max_gen_anc`: Genetic ancestry group associated with the max FAF (99% CI). + - `non_ukb`: Struct containing fafmax information for non-UKB subset (exome data only). + - `age_distribution`: Struct containing age distribution information for variant. + - `het`: Struct containing age distribution information for individuals heterozygous for this variant. Structured to allow easy histogram creation. + - `bin_edges`: Array containing the edges of each bin of the histogram. + - `bin_freq`: Array containing the frequency of individuals in this bin. + - `n_smaller`: Number of individuals with lower age than the lowest bin. + - `n_larger`: Number of individuals with a higher age than the highest bin. + - `hom`: Struct containing age distribution information for individuals homozygous for this variant. Structured to allow easy histogram creation. Contains same fields as `het` above. + - `filters`: Set containing variant QC filters. See `filters` description on the v4 Hail Tables [help page](v4-hts#filters). + - `quality_metrics`: Struct containing variant quality metric histograms information. + - `allele_balance`: Struct containing variant allele balance histograms information. + - `alt_adj`: Struct containing variant allele balance information calculated across high-quality genotypes. Contains same fields as other histogram structs. This data is displayed in the "Allele balance for heterozygotes" histogram in the browser's variant page. + - `alt_raw`: Struct containing variant allele balance information calculated across unadjusted genotypes. Contains same fields as other histogram structs. + - `genotype_depth`: Struct containing information used to display genotype depth (DP) histograms. + - `all_adj`: Struct containing DP information calculated using high-quality genotypes. Contains same fields as other histogram structs. + - `all_raw`: Struct containing DP information calculated across unadjusted genotypes. Contains same fields as other histogram structs. + - `alt_adj`: Struct containing DP information calculated using high-quality genotypes (variant carriers only). Contains same fields as other histogram structs. + - `alt_raw`: Struct containing DP information calculated across unadjusted genotypes (variant carriers only). Contains same fields as other histogram structs. + - `genotype_quality`: Struct containing information used to display genotype quality (GQ) histograms. + - `all_adj`: Struct containing GQ information calculated using high-quality genotypes. Contains same fields as other histogram structs. + - `all_raw`: Struct containing GQ information calculated across unadjusted genotypes. Contains same fields as other histogram structs. + - `alt_adj`: Struct containing GQ information calculated using high-quality genotypes (variant carriers only). Contains same fields as other histogram structs. + - `alt_raw`: Struct containing GQ information calculated across unadjusted genotypes (variant carriers only). Contains same fields as other histogram structs. + - `site quality metrics`: Array containing site quality metric information. + - `metric`: Metric name (e.g., `inbreeding_coeff). + - `value`: Metric value. +- `genome`: Struct containing information about this variant from genome data. Contains all the same fields as the exome data, with the exception that the subsets are (`all` `hgdp`, `tgp`) instead of (`all`, `non_ukb`). +- `joint`: Struct containing information about this variant for the joint exome and genome data. + - `freq`: A struct containing variant frequency information. + - `all`: Struct containing variant frequency information calculated across the combined (joint) gnomAD exomes and genomes. Contains the same fields as exomes `freq.all` struct. + - `faf`: Array of combined exomes and genomes filtering allele frequency information. See `faf` description on the v4 Hail Tables [help page](/v4-hts#joint-faf). + - `fafmax`: Struct containing information about the maximum FAF. Contains same fields as exomes `fafmax.gnomad` struct. + - `grpmax`: Allele frequency information for the non-bottlenecked genetic ancestry group with the maximum alelle frequency. See `grpmax` description on the v4 Hail Tables [help page](/v4-hts#joint-grpmax). + - `histograms`: Variant information histograms from the joint gnomAD exomes and genomes. See `histograms` description on the v4 Hail Tables [help page](v4-hts#joint-histograms). + - `qual_hists`: Genotype quality metric histograms for high quality genotypes. See v4 Hail Tables [help page](v4-hts#joint-histograms). + - `raw_qual_hists`: Genotype quality metric histograms for all genotypes as opposed to high quality genotypes. See v4 Hail Tables [help page](v4-hts#joint-histograms). + - `age_hists`: Histograms containing age information for release samples. See v4 Hail Tables [help page](v4-hts#joint-age-histograms) + - `flags`: Set containing flags about joint exome and genome data, possible values are [`discrepant_frequencies`, `not_called_in_exomes`, and `not_called_in_genomes`]. + - `freq_comparison_stats`: Struct containing results from contingency table and Cochran-Mantel-Haenszel tests comparing allele frequencies between the gnomAD exomes and genomes. See `freq_comparison_stats` description on the v4 Hail Tables [help page](/v4-hts#joint-freq-comparison-stats). +- `rsids`: dbSNP reference SNP identification (rsID) numbers. +- `in_silico_predictors`: Variant prediction annotations. Struct contains prediction scores from multiple in silico predictors. See `in_silico_predictors` description on the v4 Hail Tables [help page](v4-hts#in-silico-predictors). +- `variant_id`: gnomAD variant ID. +- `faf95_joint`: A struct containing joint (exome + genome) FAF information (95% CI). + - `grpmax`: Groupmax FAF value for all genetic ancestry groups across exomes + genomes. + - `grpmax_gen_anc`: Genetic ancestry group associated with the value `grpmax` above. +- `faf99_joint`: A struct containing joint (exome + genome) FAF (99% CI). Contains same fields as `faf95_joint`. +- `colocated_variants`: Array containing all variants (exome + genome) that are located at the same locus as this variant. +- `coverage`: Struct containing coverage information for locus. + - `exome`: Struct containing exome coverage information. + - `mean`: Mean depth of coverage at this locus. + - `median`: Median depth of coverage at this locus. + - `over_1`: Percentage of samples with a coverage greater than 1 at this locus. + - `over_5`: Percentage of samples with a coverage greater than 5 at this locus. + - `over_10`: Percentage of samples with a coverage greater than 10 at this locus. + - `over_15`: Percentage of samples with a coverage greater than 15 at this locus. + - `over_20`: Percentage of samples with a coverage greater than 20 at this locus. + - `over_25`: Percentage of samples with a coverage greater than 25 at this locus. + - `over_30`: Percentage of samples with a coverage greater than 30 at this locus. + - `over_50`: Percentage of samples with a coverage greater than 50 at this locus. + - `over_100`: Percentage of samples with a coverage greater than 100 at this locus. + - `genome`: Struct containing genome coverage information. Contains the same fields as `exome` above. +- `transcript_consequences`: Array containing variant transcript consequence information. + - `biotype`: Transcript biotype. + - `consequence_terms`: Array of predicted functional consequences. + - `domains`: Set containing protein domains affected by variant. + - `gene_id`: Unique ID of gene associated with transcript. + - `hgvsc`: HGVS coding sequence notation for variant. + - `hgvsp`: HGVS protein notation for variant. + - `is_canonical`: Whether transcript is the canonical transcript. + - `lof_filter`: Variant LoF filters (from [LOFTEE](https://github.com/konradjk/loftee)). + - `lof_flags`: LOFTEE flags. + - `lof`: Variant LOFTEE status (high confidence `HC` or low confidence `LC`). + - `major_consequence`: Primary consequence associated with transcript. + - `transcript_id`: Unique transcript ID. + - `transcript_version`: Transcript version. + - `gene_version`: Gene version. + - `is_mane_select`: Whether transcript is the MANE select transcript. + - `is_mane_select_version`: MANE Select version; has a value if this transcript is the MANE select transcript. + - `refseq_id`: RefSeq ID associated with transcript. +- `refseq_version`: RefSeq version. +- `caid`: The ClinGen Allele ID associated with this variant. +- `vrs`: Struct containing information about this variant in accordance with the [Variant Representation (VRS)](https://vrs.ga4gh.org/en/stable/) standard. + - `ref`: Struct containing information about the reference allele. + - `allele_id`: The unique Allele ID. + - `start`: The start position of the Allele. + - `end`: The end position of the Allele. + - `state`: A VRS Sequence Expression that corresponds to the nucleotide or amino acid sequence of the Allele. + +#### gnomAD v4.1. browser gene models Hail Table annotations + +Global fields: + +- `mane_select_version`: MANE Select version used to annotate variants (only present on GRCh38 Gene Models Hail Table). + +Row fields: + +- `interval`: Struct representing start and end positions of gene. +- `gene_id`: Unique ensembl gene ID. +- `gene_version`: Gene version. +- `gencode_symbol`: GENCODE gene symbol. +- `chrom`: Chromosome in which gene is located. +- `strand`: Gene strand. +- `start`: Gene genomic start position (position only). +- `stop`: Gene genomic stop position (position only). +- `xstart`: Gene genomic start position (format: chromosomeposition). xstart can be calculated with ((chrom \* 109) + pos), note that chrX is encoded as 23, chrY as 24, and chrM as 25. e.g. `1-55051215` becomes `1055051215`, and `X:9786429` becomes `23009786429`. +- `xstop`: Gene genomic stop position (format: chromosomeposition). +- `exons`: Array containing exon information for gene. + - `feature_type`: Exon type (e.g., CDS). + - `start`: Exon genomic start position (position only). + - `stop`: Exon genomic stop position (position only). + - `xstart`: Exon genomic start position (format: chromosomeposition). + - `xstop`: Exon genomic stop position (format: chromosomeposition). +- `transcripts`: Array containing information about transcripts associated with the gene. + - `interval`: Struct representing the start and end positions of transcript. + - `transcript_id`: Unique transcript ID. + - `transcript_version`: Transcript version. + - `gene_id`: Unique gene ID. + - `gene_version`: Gene version. + - `chrom`: Chromosome in which transcript is located. + - `strand`: Transcript strand. + - `start`: Transcript genomic start position (position only). + - `stop`: Transcript genomic stop position (position only). + - `xstart`: Transcript genomic start position (format: chromosomeposition). + - `xstop`: Transcript genomic stop position (format: chromosomeposition). + - `exons`: Array containing transcript exon information. + - `feature_type`: Exon type (e.g., CDS). + - `start`: Exon genomic start position (position only). + - `stop`: Exon genomic stop position (position only). + - `xstart`: Exon genomic start position (format: chromosomeposition). + - `xstop`: Exon genomic start position (format: chromosomeposition). + - `reference_genome`: Reference genome associated with this transcript. + - `refseq_id`: Transcript RefSeq ID. + - `refseq_version`: RefSeq version. +- `hgnc_id`: HGNC gene ID. +- `symbol`: Gene symbol. +- `name`: Gene name. +- `previous_symbols`: Set containing previous gene symbols. +- `alias_symbols`: Set containing alternate gene symbols. +- `omim_id`: Gene OMIM ID. +- `ncbi_id`: Gene NCBI ID. +- `symbol_upper_case`: All-caps gene symbol. +- `search_terms`: Set containing search terms associated with gene. +- `reference_genome`: Reference genome build associated with this gene. +- `flags`: Set containing gene flags for this gene. +- `canonical_transcript_id`: Canonical transcript ID. +- `mane_select_transcript`: Struct containing MANE Select transcript information. + - `matched_gene_version`: Version of the matched gene. + - `ensembl_id`: Transcript Ensembl ID. + - `ensembl_version`: Ensembl version. + - `refseq_id`: Transcript RefSeq ID. + - `refseq_version`: RefSeq version. +- `preferred_transcript_id`: Transcript shown on the gene page by default. Field contains MANE Select transcript ID if it exists, otherwise contains Ensembl canonical transcript ID. +- `preferred_transcript_source`: Source of transcript ID used for `preferred_transcript_id` field; either "`mane_select`" or "`ensembl_canonical`". +- `gnomad_constraint`: Struct containing gnomAD constraint information for gene. Struct is only present on the GRCh37 Hail Table. + - `gene`: Gene name. + - `transcript`: Transcript ID. + - `gene_id`: Unique gene ID. + - `exp_lof`: Expected number of rare (AF <= 0.1%) loss-of-function (LoF) variants. + - `exp_mis`: Expected number of rare missense variants. + - `exp_syn`: Expected number of rare synonymous variants. + - `obs_lof`: Observed number of rare loss-of-function variants. + - `obs_mis`: Observed number of rare missense variants. + - `obs_syn`: Observed number of rare synonymous variants. + - `oe_lof`: Observed/expected (OE) ratio for rare loss-of-function variants. + - `oe_lof_lower`: Lower bound of ratio for rare loss-of-function variants. + - `oe_lof_upper`: Upper bound of the OE ratio (LOEUF) for rare loss-of-function variants. + - `oe_mis`: OE ratio for rare missense variants. + - `oe_mis_lower`: Lower bound of OE ratio for rare missense variants. + - `oe_mis_upper`: Upper bound of OE ratio for rare missense variants. + - `oe_syn`: OE ratio for rare synonymous variants. + - `oe_syn_lower`: Lower bound of OE ratio for rare synonymous variants. + - `oe_syn_upper`: Upper bound of the OE ratio for rare synonymous variants. + - `lof_z`: Z-score for rare loss-of-function variants. + - `mis_z`: Z-score for rare missense variants. + - `syn_z`: Z-score for rare synonymous variants. + - `pli`: Probability of being loss-of-function intolerant (pLI) score. + - `flags`: Set containing constraint flags for transcript. +- `heterozygous_variant_cooccurrence_counts`: Array containing information about heterozygous variant co-occurrence counts. Struct is only present on GRCh37 Hail Table. + - `csq`: Variant consequence. + - `af_cutoff`: Allele frequency cutoff. + - `data`: Struct containing variant co-occurrence data. + - `in_cis`: Count of variants in cis. + - `in_trans`: Count of variants in trans. + - `unphased`: Count of unphased variants. + - `two_het_total`: Total count of two heterozygous variants. +- `homozygous_variant_cooccurrence_counts`: Array containing information about homozygous variant co-occurrence counts. Struct is only present on GRCh37 Hail Table. + - `csq`: Variant consequence. + - `af_cutoff`: Allele frequency cutoff. + - `data`: Struct containing variant co-occurrence data. + - `hom_total`: Total count of homozygous variants. diff --git a/browser/help/topics/v4-hts.md b/browser/help/topics/v4-hts.md index 4384df73b..3cd54a347 100644 --- a/browser/help/topics/v4-hts.md +++ b/browser/help/topics/v4-hts.md @@ -25,15 +25,15 @@ The '`freq`' annotation is an array, and each element of the array is a struct t Use the '`freq_index_dict`' global annotation to retrieve frequency information for a specific group of samples from the '`freq`' array. This global annotation is a dictionary keyed by sample grouping combinations whose values are the combination's index in the '`freq`' array. The groupings and their available options by version are listed in the table below. -| Category | Definition | Exome Options | Genome Options | Joint (combined exome + genome) Options | -| ---------------------------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `group` | Genotype's filter | adj1, raw | adj1, raw | adj1, raw | -| `sex` | Inferred sex/sex karyotype | XX, XY | XX, XY | XX, XY | -| `subset` | Sample subsets within release | non-UK Biobank (Download only) | HGDP, 1KG (Download Hail Table only) | N/A | -| `gen_anc` | gnomAD inferred genetic ancestry group | `afr`, `amr`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, `sas` | `afr`, `ami`, `amr`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, | `afr`, `amr`, `ami`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, `sas` | -| `gen_anc` (1KG subset only)2 | The 1KG project's ancestry | N/A | `acb`, `asw`, `beb`, `cdx`, `ceu`, `chb`, `chs`, `clm`, `esn`, `fin`, `gbr`, `gih`, `gwd`, `ibs`, `itu`, `jpt`, `khv`, `lwk`, `msl`, `mxl`, `pel`, `pjl`, `pur`, `stu`, `tsi`, `yri` | N/A | -| `gen_anc` (HGDP subset only)2 | The HGDP's ancestry labels | N/A | adygei, balochi, bantukenya, bantusafrica, basque, bedouin, biakapygmy, brahui, burusho, cambodian, colombian, dai, daur, druze, french, han, hazara, hezhen, italian, japanese, kalash, karitiana, lahu, makrani, mandenka, maya, mbutipygmy, melanesian, miaozu, mongola, mozabite, naxi, orcadian, oroqen, palestinian, papuan, pathan, pima, russian, san, sardinian, she, sindhi, surui, tu, tujia, tuscan, uygur, xibo, yakut, yizu, yoruba | N/A | -| `downsampling`3 | Downsampled sample counts | gnomAD: 10, 100, 500, 1000, 2000, 2884, 5000, 10000, 13068, 16740, 19850, 20000, 22362, 26710, 30198, 43129, 50000, 100000, 200000, 500000, 556006, non-UKB: 10, 100, 500, 1000, 2000, 2074, 5000, 8847, 10000, 10492, 16549, 18035, 20000, 21870, 26572, 34899, 50000, 100000, 175054, 200000 | The genomes release Hail Table does not contain downsampling information. | The joint frequencies Hail Table does not contain downsampling information.| +| Category | Definition | Exome Options | Genome Options | Joint (combined exome + genome) Options | +| ---------------------------------------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | +| `group` | Genotype's filter | adj1, raw | adj1, raw | adj1, raw | +| `sex` | Inferred sex/sex karyotype | XX, XY | XX, XY | XX, XY | +| `subset` | Sample subsets within release | non-UK Biobank (Download only) | HGDP, 1KG (Download Hail Table only) | N/A | +| `gen_anc` | gnomAD inferred genetic ancestry group | `afr`, `amr`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, `sas` | `afr`, `ami`, `amr`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, | `afr`, `amr`, `ami`, `asj`, `eas`, `fin`, `mid`, `nfe`, `rmi`, `sas` | +| `gen_anc` (1KG subset only)2 | The 1KG project's ancestry | N/A | `acb`, `asw`, `beb`, `cdx`, `ceu`, `chb`, `chs`, `clm`, `esn`, `fin`, `gbr`, `gih`, `gwd`, `ibs`, `itu`, `jpt`, `khv`, `lwk`, `msl`, `mxl`, `pel`, `pjl`, `pur`, `stu`, `tsi`, `yri` | N/A | +| `gen_anc` (HGDP subset only)2 | The HGDP's ancestry labels | N/A | adygei, balochi, bantukenya, bantusafrica, basque, bedouin, biakapygmy, brahui, burusho, cambodian, colombian, dai, daur, druze, french, han, hazara, hezhen, italian, japanese, kalash, karitiana, lahu, makrani, mandenka, maya, mbutipygmy, melanesian, miaozu, mongola, mozabite, naxi, orcadian, oroqen, palestinian, papuan, pathan, pima, russian, san, sardinian, she, sindhi, surui, tu, tujia, tuscan, uygur, xibo, yakut, yizu, yoruba | N/A | +| `downsampling`3 | Downsampled sample counts | gnomAD: 10, 100, 500, 1000, 2000, 2884, 5000, 10000, 13068, 16740, 19850, 20000, 22362, 26710, 30198, 43129, 50000, 100000, 200000, 500000, 556006, non-UKB: 10, 100, 500, 1000, 2000, 2074, 5000, 8847, 10000, 10492, 16549, 18035, 20000, 21870, 26572, 34899, 50000, 100000, 175054, 200000 | The genomes release Hail Table does not contain downsampling information. | The joint frequencies Hail Table does not contain downsampling information. | #### Version 4.1 sample grouping combinations and '`freq`' array access @@ -171,7 +171,7 @@ Row fields: - `a_index`: The original index of this alternate allele in the multiallelic representation (1 is the first alternate allele or the only alternate allele in a biallelic variant). - `was_split`: True if this variant was originally multiallelic, otherwise False. - `rsid`: dbSNP reference SNP identification (rsID) numbers. -- `filters`: Variant filters; AC0: Allele count is zero after filtering out low-confidence genotypes (GQ < 20; DP < 10; and AB < 0.2 for het calls), AS_VQSR: Failed allele-specific VQSR filtering thresholds of -4.0598 for SNPs and 0.1078 for indels, InbreedingCoeff: GATK InbreedingCoeff < -0.3. An empty set in this field indicates that the variant passed all variant filters. +- `filters`: Variant filters; AC0: Allele count is zero after filtering out low-confidence genotypes (GQ < 20; DP < 10; and AB < 0.2 for het calls), AS_VQSR: Failed allele-specific VQSR filtering thresholds of -4.0598 for SNPs and 0.1078 for indels, InbreedingCoeff: GATK InbreedingCoeff < -0.3. An empty set in this field indicates that the variant passed all variant filters. - `info`: Struct containing typical GATK allele-specific (AS) info fields and additional variant QC fields. - `FS`: Phred-scaled p-value of Fisher's exact test for strand bias. - `MQ`: Root mean square of the mapping quality of reads across all samples. @@ -219,7 +219,7 @@ Row fields: - `has_star`: Variant type included an upstream deletion. - `allele_type`: Allele type (snv, insertion, deletion, or mixed). - `was_mixed`: Variant type was mixed. -- `region_flags`: Struct containing flags about regions. +- `region_flags`: Struct containing flags about regions. - `non_par`: Variant falls within a non-pseudoautosomal region. - `lcr`: Variant falls within a low complexity region. - `segdup`: Variant falls within a segmental duplication region. @@ -289,7 +289,7 @@ Row fields: - `bin_freq`: Bin frequencies for the age histogram. This is the number of records found in each bin. - `n_smaller`: Count of age values falling below lowest histogram bin edge. - `n_larger`: Count of age values falling above highest histogram bin edge. -- `in_silico_predictors`: Variant prediction annotations. Struct contains prediction scores from multiple in silico predictors for variants that are predicted to be missense, impacting protein function, evolutionarily conserved, or splice-altering. We chose scores for either MANE Select or canonical transcripts if a prediction score was available for multiple transcripts. +- `in_silico_predictors`: Variant prediction annotations. Struct contains prediction scores from multiple in silico predictors for variants that are predicted to be missense, impacting protein function, evolutionarily conserved, or splice-altering. We chose scores for either MANE Select or canonical transcripts if a prediction score was available for multiple transcripts. - `cadd`: [Score](https://academic.oup.com/nar/article/47/D1/D886/5146191) used to predict deleteriousness of SNVs and indels. - `phred`: CADD Phred-like scaled C-scores ranging from 1 to 99 based on the rank of each variant relative to all possible 8.6 billion substitutions in the human reference genome. Larger values indicate increased predicted deleteriousness. - `raw_score`: Unscaled CADD scores indicating whether a variant is likely to be "observed" (negative values) vs "simulated" (positive values). Larger values indicate increased predicted deleteriousness. @@ -321,6 +321,7 @@ Row fields #### gnomAD v4.1 joint frequency Hail Table annotations The v4.1 joint (combined exomes + genomes) frequency Hail Table only contains frequencies for the following groupings: + - `group` - `sex` (`adj`1 only) - `gen_anc` (`adj`1 only) @@ -332,7 +333,7 @@ Global fields - `exomes_globals`: Global fields from the gnomAD exomes. - `freq_meta`: Allele frequency metadata for the gnomAD exomes. An ordered list containing the frequency aggregation group for each element of the `exomes.freq` array row annotation. - - `freq_index_dict`: Dictionary keyed by specified label grouping combinations (group: adj/raw, gen_anc: gnomAD inferred genetic ancestry group [adj only], sex: sex karyotype [adj only]), with values describing the corresponding index of each grouping entry in the `exomes.freq` array row annotation. + - `freq_index_dict`: Dictionary keyed by specified label grouping combinations (group: adj/raw, gen_anc: gnomAD inferred genetic ancestry group [adj only], sex: sex karyotype [adj only]), with values describing the corresponding index of each grouping entry in the `exomes.freq` array row annotation. - `freq_meta_sample_count`: A sample count per sample grouping defined in the exomes `exomes.freq_meta` global annotation. - `faf_meta`: Filtering allele frequency metadata for the gnomAD exomes. An ordered list containing the frequency aggregation group for each element of the `exomes.faf` array row annotation. - `faf_index_dict`: Dictionary keyed by specified label grouping combinations (group: adj/raw, gen_anc: gnomAD inferred genetic ancestry group, sex: sex karyotype), with values describing the corresponding index of each grouping entry in the filtering allele frequency (`exomes.faf`) row annotation. @@ -550,16 +551,16 @@ Row fields - `AF`: Combined (exomes + genomes) alternate allele frequency, (AC/AN), in release. - `AN`: Total number of alleles across exomes and genomes in release. - `homozygote_count`: Count of homozygous alternate individuals across exomes and genomes in release. - - `grpmax`: Allele frequency information (AC, AN, AF, homozygote count) for the non-bottlenecked genetic ancestry group with maximum allele frequency across both exomes and genomes. Excludes Amish (`ami`), Ashkenazi Jewish (`asj`), European Finnish (`fin`), and "Remaining individuals" (`remaining`) groups. + - `grpmax`: Allele frequency information (AC, AN, AF, homozygote count) for the non-bottlenecked genetic ancestry group with maximum allele frequency across both exomes and genomes. Excludes Amish (`ami`), Ashkenazi Jewish (`asj`), European Finnish (`fin`), and "Remaining individuals" (`remaining`) groups. - `AC`: Alternate allele count in the group with the maximum allele frequency. - `AF`: Maximum alternate allele frequency, (AC/AN), across groups in gnomAD. - `AN`: Total number of alleles in the group with the maximum allele frequency. - `homozygote_count`: Count of homozygous individuals in the group with the maximum allele frequency. - `gen_anc`: Genetic ancestry group with maximum allele frequency. - - `faf`: Array of combined exomes and genomes filtering allele frequency information (AC, AN, AF, homozygote count). Note that the values in array will correspond to the joint or combined value if the variant had a defined filtering allele frequency in both data types, otherwise this array will contain filtering allele frequencies only for the data type associated with the Hail Table (in this case, exomes). + - `faf`: Array of combined exomes and genomes filtering allele frequency information (AC, AN, AF, homozygote count). Note that the values in array will correspond to the joint or combined value if the variant had a defined filtering allele frequency in both data types, otherwise this array will contain filtering allele frequencies only for the data type associated with the Hail Table (in this case, exomes). - `faf95`: Combined exomes and genomes filtering allele frequency (using Poisson 95% CI). - `faf99`: Combined exomes and genomes filtering allele frequency (using Poisson 99% CI). - - `histograms`: Variant information histograms of the combined (joint) gnomAD exomes and genomes. + - `histograms`: Variant information histograms of the combined (joint) gnomAD exomes and genomes. - `qual_hists`: Genotype quality metric histograms for high quality genotypes. - `gq_hist_all`: Histogram for GQ calculated on high quality genotypes. - `bin_edges`: Bin edges for the GQ histogram calculated on high quality genotypes are: 0|5|10|15|20|25|30|35|40|45|50|55|60|65|70|75|80|85|90|95|100. @@ -612,7 +613,7 @@ Row fields - `bin_freq`: Bin frequencies for the histogram of AB in heterozygous individuals calculated on all genotypes. The number of records found in each bin. - `n_smaller`: Count of AB values in heterozygous individuals falling below lowest histogram bin edge, calculated on all genotypes. - `n_larger`: Count of AB values in heterozygous individuals falling above highest histogram bin edge, calculated on all genotypes. - - `age_hists`: Histograms containing age information for release samples. + - `age_hists`: Histograms containing age information for release samples. - `age_hist_het`: Histogram for age in all heterozygous release samples calculated on high quality genotypes. - `bin_edges`: Bin edges for the age histogram. - `bin_freq`: Bin frequencies for the age histogram. This is the number of records found in each bin. @@ -623,14 +624,14 @@ Row fields - `bin_freq`: Bin frequencies for the age histogram. This is the number of records found in each bin. - `n_smaller`: Count of age values falling below lowest histogram bin edge. - `n_larger`: Count of age values falling above highest histogram bin edge. -- `freq_comparison_stats`: Struct containing results from contingency table and Cochran-Mantel-Haenszel tests comparing allele frequencies between the gnomAD exomes and genomes. +- `freq_comparison_stats`: Struct containing results from contingency table and Cochran-Mantel-Haenszel tests comparing allele frequencies between the gnomAD exomes and genomes. - `contingency_table_test`: Array of results from Hail's [`contingency_table_test`](https://hail.is/docs/0.2/functions/stats.html#hail.expr.functions.contingency_table_test) with `min_cell_count=100` comparing allele frequencies between exomes and genomes. Each element in the array corresponds to the comparasion of a specific frequency aggregation group defined by the `joint.freq_meta` global field. - `odds_ratio`: Odds ratio from the contingency table test. - `p_value`: P-value from the contingency table test. - `cochran_mantel_haenszel_test`: Results from Hail's [`cochran_mantel_haenszel_test`](https://hail.is/docs/0.2/functions/stats.html#hail.expr.functions.cochran_mantel_haenszel_test) comparing allele frequencies between exomes and genomes stratified by genetic ancestry group `gen_anc`, excluding Amish (`ami`), Ashkenazi Jewish (`asj`), European Finnish (`fin`), and "Remaining individuals" (`remaining`) groups. The test is performed using the Cochran-Mantel-Haenszel test, a stratified test of independence for 2x2xK contingency tables. - `chisq`: Chi-squared test statistic from the Cochran-Mantel-Haenszel test. - `p_value`: P-value from the Cochran-Mantel-Haenszel test. - - `stat_union`: Struct containing the selected results from the contingency table and Cochran-Mantel-Haenszel tests comparing allele frequencies between exomes and genomes. When the variant is observed in only one inferred genetic ancestry group, the results from `contingency_table_test` are used. When there are multiple genetic ancestry groups, the results from `cochran_mantel_haenszel_test` are used. Excludes Amish (`ami`), Ashkenazi Jewish (`asj`), European Finnish (`fin`), and "Remaining individuals" (`remaining`) groups. If `stat_test_name` in the `stat_union` struct is `contingency_table_test`, the value of `p_value` in the `stat_union` struct is equal to `freq_comparison_stats.contingency_table_test`[`joint_globals.freq_meta`.index(`gen_ancs`[0])].`p_value`. If `stat_test_name` is `cochran_mantel_haenszel_test`, the value of `p_value` in the `stat_union` struct is equal to `freq_comparison_stats.cochran_mantel_haenszel_test`. + - `stat_union`: Struct containing the selected results from the contingency table and Cochran-Mantel-Haenszel tests comparing allele frequencies between exomes and genomes. When the variant is observed in only one inferred genetic ancestry group, the results from `contingency_table_test` are used. When there are multiple genetic ancestry groups, the results from `cochran_mantel_haenszel_test` are used. Excludes Amish (`ami`), Ashkenazi Jewish (`asj`), European Finnish (`fin`), and "Remaining individuals" (`remaining`) groups. If `stat_test_name` in the `stat_union` struct is `contingency_table_test`, the value of `p_value` in the `stat_union` struct is equal to `freq_comparison_stats.contingency_table_test`[`joint_globals.freq_meta`.index(`gen_ancs`[0])].`p_value`. If `stat_test_name` is `cochran_mantel_haenszel_test`, the value of `p_value` in the `stat_union` struct is equal to `freq_comparison_stats.cochran_mantel_haenszel_test`. - `p_value`: p-value from the contingency table or Cochran-Mantel-Haenszel tests. - `stat_test_name`: Name of the test used to compare allele frequencies between exomes and genomes. Options are `contingency_table_test` and `cochran_mantel_haenszel_test`. - `gen_ancs`: List of genetic ancestry groups included in the test. If `stat_test_name` is `contingency_table_test`, the length of `gen_ancs` is one and if `stat_test_name` is `cochran_mantel_haenszel_test`, the length of `gen_ancs` is greater than one. diff --git a/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap b/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap index 4ea2c08c1..4c27a5cdd 100644 --- a/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap +++ b/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap @@ -760,6 +760,15 @@ exports[`Help Page has no unexpected changes 1`] = ` v4-hts +