-
Notifications
You must be signed in to change notification settings - Fork 5
JSON output file
This page describes the contents of the gzipped JSON file log.json.gz
,
made when running viridian run_one_sample
.
The file contains a dictionary of run details. The main entries (keys) are:
-
run_summary
- has high-level details on the run -
stages_completed
- the progress of each main stage in the pipeline -
reads
- high-level summary of read counts -
read_depth
- genome coverage and read depth information -
amplicon_scheme_name
- name of the identified amplicon scheme -
scheme_choice
- details of the amplicon scheme scoring -
amplicons
- details of the amplicon scheme that was used -
self_qc
- details of read pileup information at each masked position -
sequences
- consensus sequence and variations (for MSAs and tree building)
Please read on for more details about the contents of each of those entries.
This section is a dictionary with a basic summary of the run. Here is an example (most of the key/value pairs in the options dictionary are omitted for brevity):
"last_stage_completed": "Finished",
"command": "viridian run_one_sample ... full command line used",
"options": {
"debug": false,
"outdir": "OUT",
"force": false,
},
"cwd": "/foo/bar/",
"version": "1.1.0",
"finished_running": true,
"start_time": "2023-09-08T13:37:59+00:00",
"end_time": "2023-09-08T13:39:28+00:00",
"hostname": "thehoff",
"result": "Success",
"errors": [],
"temp_processing_dir": "/tmp/viridian.rxs2ttki",
"total_amplicons": 98,
"successful_amplicons": 98,
"consensus_length": 29836,
"consensus_N_count": 96,
"consensus_N_percent": 0.32,
"consensus_ACGT_count": 29740,
"consensus_ACGT_percent": 99.68,
"consensus_het_count": 0,
"consensus_het_percent": 0.0,
"run_time": "0:01:29.384867"
The most important thing to check is:
"result": "Success"
meaning that the run finished successfully.
If instead is says "Fail", then something went wrong and
the details will be in the stages_completed
section.
The other entries should be self-explanatory.
This is a list of the stages that were completed. Each time a stage finishes the json file is written, so that if Viridian crashes or is killed, you can see the last stage that was run.
A successful run looks like this:
"stages_completed": [
"1/10 Start pipeline (0.0s)",
"2/10 Process amplicon scheme files (0.1s)",
"3/10 Map reads to reference (36.8s)",
"4/10 Detect amplicon scheme (2.7s)",
"5/10 Sample reads (23.3s)",
"6/10 Initial consensus sequence (6.1s)",
"7/10 Initial VCF and MSA of consensus/reference (0.4s)",
"8/10 QC using reads vs consensus sequence (17.9s)",
"9/10 Final QC checks (0.1s)",
"10/10 Tidy up final files and log (1.0s)",
"Finished"
]
The entries can vary depending on the command line options. For example, if a BAM file of mapped reads was provided, then the "Map reads to reference" stage would not be present. However, the final entry for a successful run is always "Finished".
The "reads" section is a dictionary of summary statistics of the reads. Here is an example for paired Illumina reads:
"reads": {
"unpaired_reads": 0,
"reads1": 337637,
"reads2": 337637,
"total_reads": 675274,
"mapped": 667394,
"match_any_amplicon": 328507,
"read_lengths": {
"250": 20,
"251": 675254
}
}
The meaning of these should be clear, except for match_any_amplicon
.
For unpaired reads, this is simply the number of reads that matched
to any amplicon in the chosen amplicon scheme (not all schemes
under consideration). For paired reads it is the number of read
pairs that matched, since both reads within a pair must be considered
together when matching to an amplicon - their order and orientation
is important.
The "read_lengths" dictionary is a count of the number of reads of each given read length. In that example, there were 20 reads of length 250, and the remaining 675254 reads all had length 251.
This has a summary of the read depth and genome coverage. Here is an example:
"read_depth": {
"depth_at_least": {
"1": 29865,
"2": 29862,
"5": 29862,
"10": 29836,
"15": 29836,
"20": 29794,
"50": 29600,
"100": 29600
},
"percent_at_least_x_depth": {
"1": 99.87,
"2": 99.86,
"5": 99.86,
"10": 99.78,
"15": 99.78,
"20": 99.64,
"50": 98.99,
"100": 98.99
},
"mean_depth": 5470.33,
"mode_depth": 7393,
"median_depth": 5051
}
These are all based on read mapping to the genome without using
any information on amplicons schemes.
The mean, mode and median depths are calculated with respect to
the entire genome (amplicon schemes do not cover the whole
genome). In that example, 99.64% of the genome (29794bp)
had at least 20X read depth. This is the value used during QC
(the options --coverage_min_x
and --coverage_min_pc
),
where by default Viridian requires at least 50 percent of
the genome with at least 20X read depth
This is simply a key/value pair with the chosen amplicon scheme, for example:
"amplicon_scheme_name": "COVID-ARTIC-V3"
This section has details of the amplicon scheme scores, and which scheme was chosen as best matching the reads. Example:
"scheme_choice": {
"scores": {
"COVID-ARTIC-V3": 4902,
"COVID-ARTIC-V4.1": 808,
"COVID-ARTIC-V5.0-5.3.2_400": 293,
"COVID-ARTIC-V5.0-5.2.0_1200": 184,
"COVID-MIDNIGHT-1200": 320,
"COVID-AMPLISEQ-V1": -193,
"COVID-VARSKIP-V1a-2b": 59
},
"best_schemes": [
"COVID-ARTIC-V3"
],
"best_score": 4902,
"best_scheme": "COVID-ARTIC-V3",
"score_ratio": 0.16
}
In that example, the best scheme was COVID-ARTIC-V3
, with
a score of 4902. The second-best scheme was
COVID-ARTIC-V4.1
with a score of 808. The ratio
of these (score_ratio
) was 808/4902 = 0.16.
By default, the best score needs to be at least 250, and the
ratio no more than 0.5 (options --min_scheme_score
and
--max_scheme_ratio
).
The best_schemes
entry is a list to allow for the extremely
unlikely (and never seen!) case that two schemes score
equally well. If this happened, then the score ratio would be 1
and the run halted.
This section contains details of the amplicon scheme that was chosen. It is a list of dictionaries, where each dictionary contains the information for one amplicon.
This is an example amplicon dictionary:
{
"name": "amplicon_42",
"primers": {
"left": [
{"start": 1242, "end": 1263, "read_count": 940},
],
"right": [
{"start": 1623, "end": 1650, "read_count": 955}
]
},
"excluded_primers": {
"left": [
{"start": 1200, "end": 1224, "read_count": 0}
],
"right": []
},
"start": 1242,
"end": 1650,
"dropped": false,
"reads": {
"total_reads": 1040,
"total_reads_fwd_strand": 1020,
"total_reads_rev_strand": 1020,
"qc_reads": 2044,
"qc_bases": 409041,
"qc_depth": 1000.1,
"assemble_bases": 40920
}
}
The primers
dictionary shows the primers that were
in the amplicon scheme, and supported by reads.
The excluded_primers
dictionary has the primers that
were excluded because of lack of evidence from the
reads. The "start" position is the minimum of the left
primer start positions, and the "end" position is the
maximum of the right primer end positions.
The dropped
entry is false
if a consensus sequence
was successfully produced, otherwise it is true
.
In the reads
dictionary, the total_*
values
are for all reads matching to the amplicon.
The qc_*
are the numbers used after randomly
sampling to (by default) 1000X depth.
The assemble_bases
is the total length of
the reads used to make the consensus sequence
for the amplicon.
This is really intended for debugging. It contains the full
pileup information at each masked position of the consensus sequence.
Most of this information is also in the QC TSV file
qc.tsv.gz
.
This section contains two versions of the consensus sequence:
-
masked_consensus
- this is probably the sequence you want. It is the consensus after masking using reads mapped back to the initial consensus sequence. It is the same sequence as that written to the fileconsensus.fa.gz
. -
unmasked_consensus
- the initial consensus sequence before masking.
It also contains various multiple sequence alignment (MSA) sequences:
-
msa_ref
/msa_unmasked_consensus
/masked_consensus_msa
- these all have the same length, and are an MSA of the reference genome, unmasked consensus, and masked consensus sequences. Any of these sequences can contain gaps (-
characters). -
masked_consensus_msa_indel_as_N
/masked_consensus_msa_indel_as_N
- these have the same length as the original reference genome, and may be useful for building trees. They are an MSA of the masked consensus and the reference, but gaps in the reference are not allowed (those columns of the MSA are deleted). Gaps in the consensus are replaced either withN
s (masked_consensus_msa_indel_as_N
) or with the consensus sequence (masked_consensus_msa_indel_as_N
).
The actual JSON entry looks like this, but in this example the sequences have been truncated from their full ~30kbp for readability:
"sequences": {
"unmasked_consensus": "ACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGC",
"msa_unmasked_consensus": "------------------------------ACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAA",
"msa_ref": "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC",
"masked_consensus": "NNNNNNNNNNNNNNNNNNNNNNNNAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTG",
"masked_consensus_msa": "------------------------------NNNNNNNNNNNNNNNNNNNNNNNNAGATCTGTTCTCTAAAC",
"masked_consensus_msa_indel_as_N": "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGATCT",
"masked_consensus_msa_indel_as_ref": "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGAT"
}