Skip to content

Summary stats JSON file

martinghunt edited this page Oct 9, 2020 · 2 revisions

A summary of the results are written to the file summary_stats.json. This page the contents of that file.

Methods

To understand the contents of the file, it is necessary to understand a little of the methods used by varifier. These are briefly described below, but will be documented fully in the future.

Precision

Varifier decides whether or not a single variant is correct using "probe mapping" to a truth genome (TO DO: describe elsewhere). It uses three different definitions of correct:

  1. Binary measure: the allele must be completely correct to be a true positive, otherwise it is a false positive.
  2. Fractional measure: using the proportion of the allele that is correct. This results in a value from 0 (completely wrong) to 1 (completely correct).
  3. Edit distance measure: using edit distances between the called allele and the truth, between the called allele and the reference, and between the reference allele and the truth. This also results in a value from 0 to 1, but only takes into account the parts of the allele that are different, unlike method 2. (TO DO: describe elsewhere).

Recall

Varifier generates a "truth set" of variants by comparing the truth and reference genomes (TO DO: describe elsewhere). It applies the input variants to be evaluated to the reference genome to make a new "mutated" genome, representing all the changes in the input VCF file. Probe mapping is used, using the same methods as for precision, where instead we now test the truth set of variants with the mutated genome as the truth genome in order to calculate recall. For perfect recall, we would need each truth variant to be identified as a true-positive when probe mapping against the mutated reference.

As for precision, varifier uses three different measures (binary, fractional, and edit distance) to calculate the recall.

VCF filtering and ignored records

Varifier calculates precision and recall for all records in the input VCF file that pass filtering. This filtering is controlled by the options --filter_pass and --use_ref_calls. By default all non-ref genotype calls are used. Regardless of the options, calls with no genotype (GT not present), and heterozygous calls are never used - they are treated as if they were never in the input file when calculating precision and recall. The number of removed records is reported in the output file. Records can be removed for other reasons that mean they cannot be evaluated, such as the REF column not matching the reference genome.

Output JSON file

The output JSON contains the dictionaries Excluded_record_counts, Precision, and Recall.

Example Excluded_record_counts dictionary, which should be self-explanatory:

"Excluded_record_counts": {
    "filter_fail": 42,
    "heterozygous": 1,
    "no_genotype": 3,
    "other": 1,
    "ref_call": 300

Example precision dictionary:

"Precision": 0.96862745,
"Precision_edit_dist": 0.9629981,
"Precision_frac": 0.97070057,
"TP": {
  "Count": 988,
  "SUM_ALLELE_MATCH_FRAC": 988.0,
  "SUM_EDIT_DIST": 1015
}
"FP": {
  "Count": 32,
  "SUM_ALLELE_MATCH_FRAC": 2.11458,
  "SUM_EDIT_DIST": 39
},
"EDIT_DIST_COUNTS": {
  "denominator": 1054,
  "numerator": 1015
}

The overall precision using each of the three methods is given by Precision, Precision_frac, and Precision_edit_dist, as a value between 0 and 1.

The TP and FP sections have the following keys/values:

  • Count: the number of variants that were a TP or FP. In this case, TP means the variant allele is completely correct, otherwise it is a false-positive.
  • SUM_ALLELE_MATCH_FRAC: the sum of the fraction of each allele that was correct (ie the sum of the "fractional measures")
  • SUM_EDIT_DIST: the sum of edit distances between each called allele and the reference genome. (This is not used in any precision/recall calculations and may be removed in the future).

The EDIT_DIST_COUNTS section contains the numerator and denominator for the edit distance method of calculating precision (eg 1015/1054=0.9629981).

The Recall section of the output JSON has the same format as the Precision section, except we have a false-negatives entry FN instead of false-positives FP.

Clone this wiki locally