JSON output file

This page describes the contents of the file log.json, made when running viridian_workflow run_one_sample.

The main entries in the file are:

run_summary - has high-level details on the run
read_and_primer_stats - high-level read counts (and reads mapped etc), and amplicon scheme identification details
amplicon_scheme_name - the name of the amplicon scheme that was used (more details are at the end of the section on read_and_primer_stats)
read_sampling - read depths and related information for each amplicon
viridian - details of the results of consensus calling using Viridian
self_qc - this is to be implemented.

Please read on for more details about the contents of each of those entries.

`run_summary`

An example run_summary entry is:

"run_summary": {
    "last_stage_completed": "Finished",
    "command": "viridian_workflow run_one_sample --tech illumina --ref_fasta ref.fa --reads1 reads_1.fastq.gz --reads2 reads_2.fastq.gz --outdir OUT",
    "options": {
      "debug": false,
      ... etc listing all the command line options ...
    },
    "cwd": "/working/dir/when/pipeline/was/started",
    "version": "0.1.1",
    "finished_running": true,
    "start_time": "2021-12-13T10:51:03",
    "end_time": "2021-12-13T10:54:49",
    "hostname": "myhost",
    "result": "Success",
    "run_time": "0:03:46.060333"
}

This should be mostly self-explanatory.

The file is written at several stages during the pipeline. Initially, result will be Unknown. The above example is how it looks at the end of a successful run - the key thing is that result says Success. If the pipeline detects something wrong during the run, then result will be a list of error messages. For example if too many amplicons have not enough reads to reliably call a consensus, the pipeline will stop and this will be in the output:

"result": ["Too many amplicons are too low depth. STOPPING"]

The value of finished_running is set to false at the start, and true when the pipeline stops. A value of true does not mean that a final consensus sequence was made (see result to find that out), it just means that the pipeline stopped gracefully. A value of false means something unknown went wrong, in which case you should have errors in your terminal.

`read_and_primer_stats`

This section contains information on mapping all the original input reads to the reference genome, and attempting to allocate them to amplicon(s). Here is an example for paired Illumina reads:

"read_and_primer_stats": {
    "unpaired_reads": 0,
    "reads1": 489949,
    "reads2": 489949,
    "total_reads": 979898,
    "mapped": 971562,
    "match_any_amplicon": 486271,
    "read_lengths": {
      "149": 1023,
      "150": 692551,
      ... etc. key=length, value=number of reads ...
    },
    "amplicon_scheme_set_matches": {
      "COVID-ARTIC-V3;COVID-ARTIC-V4;COVID-MIDNIGHT-1200": 83644,
      "COVID-ARTIC-V3;COVID-MIDNIGHT-1200": 298897,
      "COVID-ARTIC-V3": 84823,
      "COVID-ARTIC-V3;COVID-ARTIC-V4": 18654,
      "COVID-MIDNIGHT-1200": 252,
      "COVID-ARTIC-V4": 1
    },
    "amplicon_scheme_simple_counts": {
      "COVID-ARTIC-V3": 486018,
      "COVID-ARTIC-V4": 102299,
      "COVID-MIDNIGHT-1200": 382793
    },
    "chosen_amplicon_scheme": "COVID-ARTIC-V3"
}

The first few entries contain the number of reads: these are paired reads, so we have counts for forward reads reads1 and reverse reads reads2, and total_reads = forward plus reverse reads. For unpaired nanopore reads, the read count would be in reads, the reads1/reads2 values would be zero, and reads = total_reads. The mapped entry is simply the total number of mapped reads. The read_lengths entry is a histogram of read length to number of reads (it includes all reads, whether mapped or not).

The match_any_amplicon count is for read pairs if the reads are paired, and for reads if the reads are unpaired. It is the number of (unpaired) reads, or number of fragments/read pairs, that match to any amplicon from any of the amplicon schemes under consideration. For read pairs, the entire fragment (ie start of left read and end of right read) is considered, and therefore the count is for read pairs, not individual reads.

Since amplicon positions can overlap between amplicon schemes, a read (pair) can be allocated to zero, one, or more than one amplicon. The entry amplicon_scheme_set_matches shows the number of reads matching different combinations of schemes. For example, a read could match amplicon 1 from ARTIC V3 and amplicon 1 from Midnight-1200, and in this case the counter for "COVID-ARTIC-V3;COVID-MIDNIGHT-1200" would be incremented.

The entry amplicon_scheme_simple_counts shows the number of reads allocated to each amplicon, ignoring combinations. For example, a read matching both COVID-ARTIC-V3 and COVID-MIDNIGHT-1200 would result in the counters for both those schemes being incremented.

Finally, the entry chosen_amplicon_scheme shows the amplicon scheme that was chosen. Currently the naive method of taking the scheme with the most counts from amplicon_scheme_simple_counts is used. This may change in the future.

Note that there is a top-level entry in the JSON file called amplicon_scheme_name. This is the scheme that was actually used. It will usually be the same as chosen_amplicon_scheme. However, if the option to force the scheme was used (--force_amp_scheme) then amplicon_scheme_name will be that forced choice, regardless of the result in chosen_amplicon_scheme.

`read_sampling`

This section contains details of mapped reads, depths, and sampled reads from each amplicon in the chosen amplicon scheme.

It looks like this:

"read_sampling": {
    "nCoV-2019_1_pool1": {
      "start": 31,
      "end": 410,
      "total_mapped_bases": 2627260,
      "total_depth": 6913.84,
      "sampled_bases": 380387,
      "sampled_depth": 1001.02,
      "pass": true
    },
    "nCoV-2019_2_pool2": {
    ... details for this amplicon ...
    },
    ... remaining amplicons ...
}

Each key is an amplicon name, and each value has the results for that amplicon. The start and end coordinates of the amplicon in the reference genome are in start and end. There are 1-based inclusive coordinates. total_mapped_bases is the total number of bases from the reads mapped to that amplicon (ie we exclude trimmed bases). total_depth is the mean depth of the amplicon from all of the reads. sampled_bases is the total length of the sampled reads, and sampled_depth is the mean depth of the sampled reads.

After sampling, the pipeline checks if each amplicon has at least 10X mean read depth (ie sampled_depth at least 10). The entry pass is set to true for amplicons that do have enough depth, and false otherwise.

`viridian`

This is a direct copy of the JSON file made when running Viridian, as described in the Viridian wiki JSON output file page.

The entries likely to be of most interest can both be found in the viridian -> run_summary section:

consensus - this is the consensus sequence made by Viridian. It is the sequence that is subsequently masked to make the final output of Viridian Workflow. If you want the final masked sequence from Viridian Workflow, then do not get it here!
successful_amplicons and total_amplicons - these give the number of successfully assembled amplicons, and the total amplicons in the scheme that the workflow used.
amplicon_success - this is a dictionary where the amplicon names are the keys, and values are true or false, showing if a consensus sequence was generated or not.

`self_qc`

Viridian workflow remaps all reads to the consensus sequence to mask low quality consensus base calls with Ns. A summary of this step is in the self_qc block:

"self_qc": {
     "masking_summary": {
       "consensus_length": 29499,
       "low_frs": 66,
       "total_masked": 652,
       "already_masked": 586
     },

Many sites will have already been masked during the amplicon detection step. These positions are counted in the already_masked field. The total_masked field is the number of Ns that have been inserted into the consensus by all stages of the viridian workflow. consensus_length is the length of the viridian consensus including masked positions.

There are also per-position records that explain why a consensus base has been masked:

     "5617": [
       "Insufficient support of consensus base; 88 / 144 < 0.7. 144 including primer regions."
     ],
     "13040": [
       "Insufficient depth to evaluate consensus; 58 < 100. 58 including primer regions."
     ],

For each of these positions there is a list of criteria which have failed. These positions are consensus sequence coordinates.

Provide feedback

Saved searches