Skip to content

Commit

Permalink
Merge branch 'jg/update_combined_freq_stats_with_union' of https://gi…
Browse files Browse the repository at this point in the history
…thub.com/broadinstitute/gnomad-browser into jg/update_combined_freq_stats_with_union
  • Loading branch information
ch-kr committed Apr 17, 2024
2 parents 92c8bfd + bb9f58c commit 5e0fa28
Showing 1 changed file with 8 additions and 7 deletions.
15 changes: 8 additions & 7 deletions browser/help/topics/combined-freq-stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,24 @@ id: combined-freq-stats
title: 'Combined genomes and exomes frequency statistics'
---

**Contingency Table Test (Chi-squared or Fisher's exact test):**
### <a id="contingency_table_test"></a> Contingency Table Test (Chi-squared or Fisher's exact test)

We applied Hail's [contingency_table_test](https://hail.is/docs/0.2/functions/stats.html#hail.expr.functions.contingency_table_test) to a 2x2 table representing allele counts (AC) and allele numbers (AN) for both exomes and genomes. The minimum cell count in the contingency table defines whether a chi-squared or Fisher's exact test is used. We used a threshold of `min_cell_count=100`, meaning that if all cell counts in the 2x2 table were over 100, we used a Fisher's exact test. We generated odds ratio and p-values for all variants in both the gnomAD exomes and genomes.

**Cochran–Mantel–Haenszel (CMH) Test:**
### <a id="cmh_test"></a> Cochran–Mantel–Haenszel (CMH) Test

This stratified test of independence is applied to 2x2xK contingency tables, where K represents the number of strata (in this case, inferred genetic ancestry groups). The CMH test provides a way to assess variant frequency differences between exomes and genomes while controlling for population structure, offering a more nuanced understanding of the discrepancies observed. The CMH test is computed using the [stats.contingency_tables.StratifiedTable](https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.StratifiedTable.html) function from [statsmodels](https://www.statsmodels.org/stable/index.html), which outputs a chi-squared test statistic and corresponding p-value. **Note** that we use results from the CMH test and not the contingency table test to flag variants with highly discordant frequencies between the exomes and genomes.
This stratified [test](https://en.wikipedia.org/wiki/Cochran%E2%80%93Mantel%E2%80%93Haenszel_statistics) of independence is applied to 2x2xK contingency tables, where K represents the number of strata (in this case, inferred genetic ancestry groups). The CMH test provides a way to assess variant frequency differences between exomes and genomes while controlling for population structure, offering a more nuanced understanding of the discrepancies observed. The CMH test is computed using Hail's [cochran_mantel_haenszel_test](https://hail.is/docs/0.2/functions/stats.html#hail.expr.functions.cochran_mantel_haenszel_test), which outputs a chi-squared test statistic and corresponding p-value.

**Variant Warnings Based on CMH Test Results:**
### Variant Warnings Based on Contingency Table and CMH Test Results

In gnomAD v4.1, we add a warning to variants exhibiting highly discordant frequencies between the exomes and genomes. By leveraging CMH test statistics, we've pinpointed variants where the CMH p-value is less than 10<sup>-4</sup> and have flagged these variants for users' attention. About 4% of variants (2,486,726 out of 57,553,936) exhibit statistically different frequencies between the two data types at this threshold.
In gnomAD v4.1, we add a warning to variants exhibiting highly discordant frequencies between the exomes and genomes. For variants observed in a single inferred genetic ancestry group, we use the [contingency table test](/help/combined-freq-stats#contingency_table_test) on allele counts in that genetic ancestry group. Otherwise, in cases where the variant is present in multiple genetic ancestry groups, we use the [CMH test](/help/combined-freq-stats#cmh_test), to compare variant frequencies while accounting for differences driven by inferred genetic ancestry group structure in the datasets. By leveraging these test statistics, we've pinpointed variants where the contingency table test or CMH p-value is less than 10<sup>-4</sup> and have flagged these variants for users' attention. About 2.5% (2,230,151 out of 91,177,483) exhibit statistically different frequencies between the two data types at this threshold.

The expected number of variants to reach this threshold by chance is 5,800 (out of 58 million total variants shared between the exomes and genomes). Observing approximately 430 times more variants than expected highlights the robustness of our approach. The CMH p-value distribution further supports the validity of our warnings, showing minimal baseline inflation and underscoring the significance of flagged variants.
The expected number of variants to reach this threshold by chance is 9,100 (out of 91 million total variants shared between the exomes and genomes). Observing approximately 245 times more variants than expected highlights the robustness of our approach. The p-value distribution further supports the validity of our warnings, showing minimal baseline inflation and underscoring the significance of flagged variants.

# TODO: Change image
<img src="cmh-pval.png" alt= "CMH p-value distribution" width="50%" height="50%">

**Why have we added these statistical tests**
### Why have we added these statistical tests

For the first time, in gnomAD v4.0, we released a combined filtering allele frequency (FAF), integrating variant frequencies across the 734,947 exomes and 76,215 genomes. This integration brings the advantage of a larger, more diverse sample set but also introduces challenges. These challenges stem from differences in sequencing and processing methodologies, as well as variations in sample composition due to ascertainment biases. Addressing these challenges is crucial for providing accurate and reliable genetic insights.

Expand Down

0 comments on commit 5e0fa28

Please sign in to comment.