-
Notifications
You must be signed in to change notification settings - Fork 1
Output files
There are three output files:
-
results.tsv
- this is the overall file of results for the run. It is most likely the file you want. It is a tab-delimited file of counts of the truth bases vs what was called in the consensus. -
results.json
- this contains the same information as inresults.tsv
, but is in JSON format. -
per_position.tsv
- a detailed tab-delimited file, containing one line per reference position. It shows the multiple alignment of the reference, truth (inferred from the truth VCF file), and the sequence being evaluated. At each position the assigned category of truth and called bases is shown, where the categories are the same as those used inresults.tsv
.
We will describe the per_position.tsv
file first, since it understanding it is
necessary to understand results.tsv
.
The overall method is to first make an inferred "truth genome" by applying all the
variants in the input truth VCF file to the reference genome. Then a multiple
sequence alignment (MSA) of the reference, inferred truth, and sequence to be
evaluated is made. This MSA is inspected at each position to categorise
the correctness of the sequence to be evaluated. The results of this
per-position evaluation is written to per_position.tsv
. It will look
like this:
Ref_pos Ref Truth Consensus Truth_category Consensus_category
31 A A A True_ref Called_ref
32 C C C True_ref Called_ref
33 C C C True_ref Called_ref
... etc for the rest of the aligned genomes ...
In this case, it starts at position 31 because this example is the start of output where the amplicon scheme was ARTIC version 3. The first amplicon starts at position 31, so we do not evaluate positions before 31.
There are many possibilities in the Truth_category
and Consensus_category
!
In the above 3 rows, the reference, truth genome, and consensus
sequence were in agreement. The truth category True_ref
means that
the "truth" call is that the consensus should match the reference.
Since it does match, the consensus category is Called_ref
because
it did indeed call the reference base. Most lines of this file
will look like this (unless you evaluate terrible consensus!), since
most positions will have no variation.
The possible truth categories are:
-
True_ref
: the reference and truth bases are the same -
SNP_true_alt
: the truth base is different from the reference, and is one ofA
,C
,G
,T
. -
SNP_true_mixed
: the truth base is different from the reference, and is a heterozygous call - one of the ambiguous heterozygous IUPAC codesK
,M
,R
,S
,W
,Y
. -
Unknown_truth
: the truth is unknown, for which we useN
. -
True_indel
: there should be an insertion or deletion, which means either the reference or truth base will be-
. -
Dropped_amplicon
: we expect the amplicon to have no consensus called because it was missing from the sequencing reads. The truth base will beZ
in this case.
The consensus categories are:
-
Called_ref
: the consensus base is the same as the reference -
Called_correct_alt
: the truth is a homozygous SNP or indel and the consensus called it correctly -
Called_wrong_alt
: the consensus has an incorrect homozygous SNP -
Called_N
: the consensus called anN
-
Called_correct_IUPAC
: the consensus correctly called a heterozygous SNP -
Called_wrong_IUPAC
: the consensus incorrectly called a heterozygous SNP -
Called_wrong_indel
: the consensus incorrectly called an indel -
Dropped_amplicon
: the consensus looks like a dropped amplicon. This is determined by inspecting all runs ofN
s in the consensus sequence. If such a run covers an amplicon, then theN
s at the amplicon position are replaced withZ
, which is used to denote a dropped amplicon. -
Called_other
: when the truth is not known (Unknown _truth
), this consensus category is used (except when the category isDropped_amplicon
,No_call_genome_ends
, or calledN
). -
No_call_genome_ends
: depending on the method used to generate the consensus, the consensus may not begin at the start of the first amplicon, or finish at the end of the last amplicon. Such positions are classified asNo_call_genome_ends
, instead of calling them a deletion. The charactere
is used in the output file.
Below is an example per_position.tsv
file, on a made up tiny genome.
It has most of the examples
you would expect to see in a real run on the full SARS-CoV-2 genome.
Its amplicon scheme sequences from position 5 to 42, so those are the
coordinates in the output file. It has a (very small!) dropped amplicon
at 27-30.
Ref_pos Ref Truth Consensus Truth_category Consensus_category
5 A A e True_ref No_call_genome_ends
6 C C e True_ref No_call_genome_ends
7 G G G True_ref Called_ref
8 G G G True_ref Called_ref
9 C A A SNP_true_alt Called_correct_alt
10 C G C SNP_true_alt Called_ref
11 A T G SNP_true_alt Called_wrong_alt
12 G C N SNP_true_alt Called_N
13 C C C True_ref Called_ref
14 C C Y True_ref Called_wrong_IUPAC
15 T Y T SNP_true_mixed Called_ref
16 T K N SNP_true_mixed Called_N
17 G Y Y SNP_true_mixed Called_correct_IUPAC
18 G Y R SNP_true_mixed Called_wrong_IUPAC
19 T T T True_ref Called_ref
20 A A A True_ref Called_ref
21 C C C True_ref Called_ref
22 A N A Unknown_truth Called_ref
23 A N N Unknown_truth Called_N
24 G G G True_ref Called_ref
25 A A A True_ref Called_ref
26 A A A True_ref Called_ref
27 A Z G Dropped_amplicon Called_other
28 G Z Z Dropped_amplicon Dropped_amplicon
29 C Z Z Dropped_amplicon Dropped_amplicon
30 T Z Z Dropped_amplicon Dropped_amplicon
31 A A A True_ref Called_ref
32 T T - True_ref Called_wrong_indel
33 C C C True_ref Called_ref
34 A A A True_ref Called_ref
35 - - C True_ref Called_wrong_indel
35 - - C True_ref Called_wrong_indel
35 C C C True_ref Called_ref
36 - C C True_indel Called_correct_alt
36 T T T True_ref Called_ref
37 A - - True_indel Called_correct_alt
38 A - - True_indel Called_correct_alt
39 T T T True_ref Called_ref
40 G G e True_ref No_call_genome_ends
41 C C e True_ref No_call_genome_ends
42 A A e True_ref No_call_genome_ends
A summary of the per_position.tsv
output is written to the file results.tsv
in tab-delimited format, and also in JSON format in results.json
.
It contains counts of the Truth_category
/Consensus_category
combinations
from the per-position analysis in per_position.tsv
.
It also counts the combinations, but just for positions that are inside
primers.
The above per-position toy example would generate
something like (disclaimer: the counts may not match here because
this is just a made up example! They will match for real data)
this results.tsv
:
Truth Called_ref Called_correct_alt Called_wrong_alt Called_N Called_correct_IUPAC Called_wrong_IUPAC Called_wrong_indel Dropped_amplicon Called_other No_call_genome_ends
True_ref 15 0 0 0 0 1 3 0 0 5
SNP_true_alt 1 1 1 1 0 0 0 0 0 0
SNP_true_mixed 1 0 0 1 1 1 0 0 0 0
Unknown_truth 1 0 0 1 0 0 0 0 0 0
True_indel 0 3 0 0 0 0 0 0 0 0
Dropped_amplicon 0 0 0 0 0 0 0 4 0 0
P_True_ref 10 0 0 0 0 0 0 0 0 0
P_SNP_true_alt 0 0 0 0 0 0 0 0 0 0
P_SNP_true_mixed 0 0 0 0 0 0 0 0 0 0
P_Unknown_truth 0 0 0 0 0 0 0 0 0 0
P_True_indel 0 0 0 0 0 0 0 0 0 0
P_Dropped_amplicon 0 0 0 0 0 0 0 2 0 0
The top half shows counts across the whole genome, and the bottom half - the
rows called P_*
- are counts for just the positions that are in primers.
The exact same information is written to results.json
, containing two
dictionaries: one for the whole genome (key="All") and the other for just
primers (key="Primer").
The keys/values are the rows/columns of the table. The format is like this:
{
"All": {
"True_ref": {
"Called_ref": 29080,
"Called_correct_alt": 0,
"Called_wrong_alt": 0,
... etc ..
},
"SNP_true_alt": {
"Called_ref": 10,
... etc ...
},
... etc ...
},
"Primer": {
... same format as the "All" entry ...
}
}