Skip to content

Output files

martinghunt edited this page Apr 11, 2022 · 7 revisions

Overview

There are three output files:

  1. results.tsv - this is the overall file of results for the run. It is most likely the file you want. It is a tab-delimited file of counts of the truth bases vs what was called in the consensus.
  2. results.json - this contains the same information as in results.tsv, but is in JSON format.
  3. per_position.tsv - a detailed tab-delimited file, containing one line per reference position. It shows the multiple alignment of the reference, truth (inferred from the truth VCF file), and the sequence being evaluated. At each position the assigned category of truth and called bases is shown, where the categories are the same as those used in results.tsv.

Categorising each base of the genome

We will describe the per_position.tsv file first, since it understanding it is necessary to understand results.tsv.

The overall method is to first make an inferred "truth genome" by applying all the variants in the input truth VCF file to the reference genome. Then a multiple sequence alignment (MSA) of the reference, inferred truth, and sequence to be evaluated is made. This MSA is inspected at each position to categorise the correctness of the sequence to be evaluated. The results of this per-position evaluation is written to per_position.tsv. It will look like this:

Ref_pos  Ref  Truth  Consensus  Truth_category  Consensus_category
31       A    A      A          True_ref        Called_ref
32       C    C      C          True_ref        Called_ref
33       C    C      C          True_ref        Called_ref
... etc for the rest of the aligned genomes ...

In this case, it starts at position 31 because this example is the start of output where the amplicon scheme was ARTIC version 3. The first amplicon starts at position 31, so we do not evaluate positions before 31.

There are many possibilities in the Truth_category and Consensus_category! In the above 3 rows, the reference, truth genome, and consensus sequence were in agreement. The truth category True_ref means that the "truth" call is that the consensus should match the reference. Since it does match, the consensus category is Called_ref because it did indeed call the reference base. Most lines of this file will look like this (unless you evaluate terrible consensus!), since most positions will have no variation.

The possible truth categories are:

  • True_ref: the reference and truth bases are the same
  • SNP_true_alt: the truth base is different from the reference, and is one of A, C, G, T.
  • SNP_true_mixed: the truth base is different from the reference, and is a heterozygous call - one of the ambiguous heterozygous IUPAC codes K, M, R, S, W, Y.
  • Unknown_truth: the truth is unknown, for which we use N.
  • True_indel: there should be an insertion or deletion, which means either the reference or truth base will be -.
  • Dropped_amplicon: we expect the amplicon to have no consensus called because it was missing from the sequencing reads. The truth base will be Z in this case.

The consensus categories are:

  • Called_ref: the consensus base is the same as the reference
  • Called_correct_alt: the truth is a homozygous SNP or indel and the consensus called it correctly
  • Called_wrong_alt: the consensus has an incorrect homozygous SNP
  • Called_N: the consensus called an N
  • Called_correct_IUPAC: the consensus correctly called a heterozygous SNP
  • Called_wrong_IUPAC: the consensus incorrectly called a heterozygous SNP
  • Called_wrong_indel: the consensus incorrectly called an indel
  • Dropped_amplicon: the consensus looks like a dropped amplicon. This is determined by inspecting all runs of Ns in the consensus sequence. If such a run covers an amplicon, then the Ns at the amplicon position are replaced with Z, which is used to denote a dropped amplicon.
  • Called_other: when the truth is not known (Unknown _truth), this consensus category is used (except when the category is Dropped_amplicon, No_call_genome_ends, or called N).
  • No_call_genome_ends: depending on the method used to generate the consensus, the consensus may not begin at the start of the first amplicon, or finish at the end of the last amplicon. Such positions are classified as No_call_genome_ends, instead of calling them a deletion. The character e is used in the output file.

Example per-position output

Below is an example per_position.tsv file, on a made up tiny genome. It has most of the examples you would expect to see in a real run on the full SARS-CoV-2 genome. Its amplicon scheme sequences from position 5 to 42, so those are the coordinates in the output file. It has a (very small!) dropped amplicon at 27-30.

Ref_pos  Ref  Truth  Consensus  Truth_category   Consensus_category
5        A    A      e          True_ref         No_call_genome_ends
6        C    C      e          True_ref         No_call_genome_ends
7        G    G      G          True_ref         Called_ref
8        G    G      G          True_ref         Called_ref
9        C    A      A          SNP_true_alt     Called_correct_alt
10       C    G      C          SNP_true_alt     Called_ref
11       A    T      G          SNP_true_alt     Called_wrong_alt
12       G    C      N          SNP_true_alt     Called_N
13       C    C      C          True_ref         Called_ref
14       C    C      Y          True_ref         Called_wrong_IUPAC
15       T    Y      T          SNP_true_mixed   Called_ref
16       T    K      N          SNP_true_mixed   Called_N
17       G    Y      Y          SNP_true_mixed   Called_correct_IUPAC
18       G    Y      R          SNP_true_mixed   Called_wrong_IUPAC
19       T    T      T          True_ref         Called_ref
20       A    A      A          True_ref         Called_ref
21       C    C      C          True_ref         Called_ref
22       A    N      A          Unknown_truth    Called_ref
23       A    N      N          Unknown_truth    Called_N
24       G    G      G          True_ref         Called_ref
25       A    A      A          True_ref         Called_ref
26       A    A      A          True_ref         Called_ref
27       A    Z      G          Dropped_amplicon Called_other
28       G    Z      Z          Dropped_amplicon Dropped_amplicon
29       C    Z      Z          Dropped_amplicon Dropped_amplicon
30       T    Z      Z          Dropped_amplicon Dropped_amplicon
31       A    A      A          True_ref         Called_ref
32       T    T      -          True_ref         Called_wrong_indel
33       C    C      C          True_ref         Called_ref
34       A    A      A          True_ref         Called_ref
35       -    -      C          True_ref         Called_wrong_indel
35       -    -      C          True_ref         Called_wrong_indel
35       C    C      C          True_ref         Called_ref
36       -    C      C          True_indel       Called_correct_alt
36       T    T      T          True_ref         Called_ref
37       A    -      -          True_indel       Called_correct_alt
38       A    -      -          True_indel       Called_correct_alt
39       T    T      T          True_ref         Called_ref
40       G    G      e          True_ref         No_call_genome_ends
41       C    C      e          True_ref         No_call_genome_ends
42       A    A      e          True_ref         No_call_genome_ends

Summary results table

A summary of the per_position.tsv output is written to the file results.tsv in tab-delimited format, and also in JSON format in results.json. It contains counts of the Truth_category/Consensus_category combinations from the per-position analysis in per_position.tsv. It also counts the combinations, but just for positions that are inside primers.

The above per-position toy example would generate something like (disclaimer: the counts may not match here because this is just a made up example! They will match for real data) this results.tsv:

Truth               Called_ref  Called_correct_alt  Called_wrong_alt  Called_N  Called_correct_IUPAC  Called_wrong_IUPAC  Called_wrong_indel  Dropped_amplicon  Called_other  No_call_genome_ends
True_ref            15          0                   0                 0         0                     1                   3                   0                 0             5
SNP_true_alt        1           1                   1                 1         0                     0                   0                   0                 0             0
SNP_true_mixed      1           0                   0                 1         1                     1                   0                   0                 0             0
Unknown_truth       1           0                   0                 1         0                     0                   0                   0                 0             0
True_indel          0           3                   0                 0         0                     0                   0                   0                 0             0
Dropped_amplicon    0           0                   0                 0         0                     0                   0                   4                 0             0
P_True_ref          10          0                   0                 0         0                     0                   0                   0                 0             0
P_SNP_true_alt      0           0                   0                 0         0                     0                   0                   0                 0             0
P_SNP_true_mixed    0           0                   0                 0         0                     0                   0                   0                 0             0
P_Unknown_truth     0           0                   0                 0         0                     0                   0                   0                 0             0
P_True_indel        0           0                   0                 0         0                     0                   0                   0                 0             0
P_Dropped_amplicon  0           0                   0                 0         0                     0                   0                   2                 0             0

The top half shows counts across the whole genome, and the bottom half - the rows called P_* - are counts for just the positions that are in primers.

The exact same information is written to results.json, containing two dictionaries: one for the whole genome (key="All") and the other for just primers (key="Primer"). The keys/values are the rows/columns of the table. The format is like this:

{
  "All": {
    "True_ref": {
      "Called_ref": 29080,
      "Called_correct_alt": 0,
      "Called_wrong_alt": 0,
      ... etc ..
    },
    "SNP_true_alt": {
      "Called_ref": 10,
      ... etc ...
    },
    ... etc ...
  },
  "Primer": {
  ... same format as the "All" entry ...
  }
}
Clone this wiki locally