-
Notifications
You must be signed in to change notification settings - Fork 5
JSON output file
This page describes the contents of the file log.json
, made when running
viridian_workflow run_one_sample
.
The main entries in the file are:
-
run_summary
- has high-level details on the run -
read_and_primer_stats
- high-level read counts (and reads mapped etc), and amplicon scheme identification details -
amplicon_scheme_name
- the name of the amplicon scheme that was used (more details are at the end of the section onread_and_primer_stats
) -
read_sampling
- read depths and related information for each amplicon -
viridian
- details of the results of consensus calling using Viridian -
self_qc
- this is to be implemented.
Please read on for more details about the contents of each of those entries.
An example run_summary
entry is:
"run_summary": {
"last_stage_completed": "Finished",
"command": "viridian_workflow run_one_sample --tech illumina --ref_fasta ref.fa --reads1 reads_1.fastq.gz --reads2 reads_2.fastq.gz --outdir OUT",
"options": {
"debug": false,
... etc listing all the command line options ...
},
"cwd": "/working/dir/when/pipeline/was/started",
"version": "0.1.1",
"finished_running": true,
"start_time": "2021-12-13T10:51:03",
"end_time": "2021-12-13T10:54:49",
"hostname": "myhost",
"result": "Success",
"run_time": "0:03:46.060333"
}
This should be mostly self-explanatory.
The file is written at several stages during the pipeline. Initially,
result
will be Unknown
.
The above example is how it looks at the end of a successful run - the
key thing is that result
says Success
. If the pipeline detects something
wrong during the run, then result
will be a list of error messages. For
example if too many amplicons have not enough reads to reliably call a
consensus, the pipeline will stop and this will be in the output:
"result": ["Too many amplicons are too low depth. STOPPING"]
The value of finished_running
is set to false
at the start, and
true
when the pipeline stops. A value of true
does not mean that
a final consensus sequence was made (see result
to find that out),
it just means that the pipeline stopped gracefully. A value of
false
means something unknown went wrong, in which case you should
have errors in your terminal.
This section contains information on mapping all the original input reads to the reference genome, and attempting to allocate them to amplicon(s). Here is an example for paired Illumina reads:
"read_and_primer_stats": {
"unpaired_reads": 0,
"reads1": 489949,
"reads2": 489949,
"total_reads": 979898,
"mapped": 971562,
"match_any_amplicon": 486271,
"read_lengths": {
"149": 1023,
"150": 692551,
... etc. key=length, value=number of reads ...
},
"amplicon_scheme_set_matches": {
"COVID-ARTIC-V3;COVID-ARTIC-V4;COVID-MIDNIGHT-1200": 83644,
"COVID-ARTIC-V3;COVID-MIDNIGHT-1200": 298897,
"COVID-ARTIC-V3": 84823,
"COVID-ARTIC-V3;COVID-ARTIC-V4": 18654,
"COVID-MIDNIGHT-1200": 252,
"COVID-ARTIC-V4": 1
},
"amplicon_scheme_simple_counts": {
"COVID-ARTIC-V3": 486018,
"COVID-ARTIC-V4": 102299,
"COVID-MIDNIGHT-1200": 382793
},
"chosen_amplicon_scheme": "COVID-ARTIC-V3"
}
The first few entries contain the number of reads: these are paired
reads, so we have counts for forward reads reads1
and reverse reads
reads2
, and total_reads
= forward plus reverse reads. For unpaired
nanopore reads, the read count would be in reads
, the
reads1
/reads2
values would be zero, and reads
= total_reads
.
The mapped
entry is simply the total number of mapped reads.
The read_lengths
entry is a histogram of read length to number
of reads (it includes all reads, whether mapped or not).
The match_any_amplicon
count is for read pairs if the reads
are paired, and for reads if the reads are unpaired. It is the
number of (unpaired) reads, or number of fragments/read pairs,
that match to any amplicon from any of the amplicon schemes under
consideration. For read pairs, the entire fragment (ie start of left read
and end of right read) is considered, and therefore the
count is for read pairs, not individual reads.
Since amplicon positions can overlap between amplicon schemes, a read (pair)
can be allocated to zero, one, or more than one amplicon. The entry
amplicon_scheme_set_matches
shows the number of reads matching
different combinations of schemes. For example, a read could match
amplicon 1 from ARTIC V3 and amplicon 1 from Midnight-1200, and in this
case the counter for "COVID-ARTIC-V3;COVID-MIDNIGHT-1200"
would be
incremented.
The entry amplicon_scheme_simple_counts
shows the number of reads
allocated to each amplicon, ignoring combinations. For example, a
read matching both COVID-ARTIC-V3
and COVID-MIDNIGHT-1200
would
result in the counters for both those schemes being incremented.
Finally, the entry chosen_amplicon_scheme
shows the amplicon
scheme that was chosen. Currently the naive method of taking the
scheme with the most counts from amplicon_scheme_simple_counts
is used. This may change in the future.
Note that there is a top-level entry in the JSON file called
amplicon_scheme_name
. This is the scheme that was actually
used. It will usually be the same as chosen_amplicon_scheme
.
However, if the option to force the scheme was used
(--force_amp_scheme
) then amplicon_scheme_name
will be
that forced choice, regardless of the result in
chosen_amplicon_scheme
.
This section contains details of mapped reads, depths, and sampled reads from each amplicon in the chosen amplicon scheme.
It looks like this:
"read_sampling": {
"nCoV-2019_1_pool1": {
"start": 31,
"end": 410,
"total_mapped_bases": 2627260,
"total_depth": 6913.84,
"sampled_bases": 380387,
"sampled_depth": 1001.02,
"pass": true
},
"nCoV-2019_2_pool2": {
... details for this amplicon ...
},
... remaining amplicons ...
}
Each key is an amplicon name, and each value has the results for
that amplicon. The start and end coordinates of the amplicon in
the reference genome are in start
and end
. There are 1-based
inclusive coordinates. total_mapped_bases
is the total number
of bases from the reads mapped to that amplicon (ie we exclude
trimmed bases). total_depth
is the mean depth of the amplicon
from all of the reads. sampled_bases
is the total length of the
sampled reads, and sampled_depth
is the mean depth of the
sampled reads.
After sampling, the pipeline checks if each amplicon has at least
10X mean read depth (ie sampled_depth
at least 10). The
entry pass
is set to true
for amplicons that do have
enough depth, and false
otherwise.
This is a direct copy of the JSON file made when running Viridian, as described in the Viridian wiki JSON output file page.
The entries likely to be of most interest can both be found in the
viridian
-> run_summary
section:
-
consensus
- this is the consensus sequence made by Viridian. It is the sequence that is subsequently masked to make the final output of Viridian Workflow. If you want the final masked sequence from Viridian Workflow, then do not get it here! -
successful_amplicons
andtotal_amplicons
- these give the number of successfully assembled amplicons, and the total amplicons in the scheme that the workflow used. -
amplicon_success
- this is a dictionary where the amplicon names are the keys, and values aretrue
orfalse
, showing if a consensus sequence was generated or not.
Viridian workflow remaps all reads to the consensus sequence to mask low quality consensus base calls with N
s. A summary of this step is in the self_qc
block:
"self_qc": {
"masking_summary": {
"consensus_length": 29499,
"low_frs": 66,
"total_masked": 652,
"already_masked": 586
},
Many sites will have already been masked during the amplicon detection step. These positions are counted in the already_masked
field. The
total_masked
field is the number of N
s that have been inserted into the consensus by all stages of the viridian workflow. consensus_length
is the length of the viridian consensus including masked positions.
There are also per-position records that explain why a consensus base has been masked:
"5617": [
"Insufficient support of consensus base; 88 / 144 < 0.7. 144 including primer regions."
],
"13040": [
"Insufficient depth to evaluate consensus; 58 < 100. 58 including primer regions."
],
For each of these positions there is a list of criteria which have failed. These positions are consensus sequence coordinates.