Skip to content

JSON output file

martinghunt edited this page Dec 13, 2021 · 28 revisions

This page describes the contents of the file log.json, made when running viridian_workflow run_one_sample.

The main entries in the file are:

  • run_summary - has high-level details on the run
  • read_and_primer_stats - high-level read counts (and reads mapped etc), and amplicon scheme identification details
  • read_sampling - read depths and related information for each amplicon
  • viridian - details of the results of consensus calling using Viridian
  • self_qc - this is to be implemented.

Please read on below for more details about the contents of each of those entries.

run_summary

An example run_summary entry is:

"run_summary": {
    "last_stage_completed": "Finished",
    "command": "viridian_workflow run_one_sample --tech illumina --ref_fasta ref.fa --reads1 reads_1.fastq.gz --reads2 reads_2.fastq.gz --outdir OUT",
    "options": {
      "debug": false,
      ... etc listing all the command line options ...
    },
    "cwd": "/hps/nobackup/iqbal/mhunt/Covid_test_data_20210813.VWF.20211213.d1932ec1ea/Thielen",
    "version": "0.1.1",
    "finished_running": true,
    "start_time": "2021-12-13T10:51:03",
    "end_time": "2021-12-13T10:54:49",
    "hostname": "myhost",
    "result": "Success",
    "run_time": "0:03:46.060333"

This should be mostly self-explanatory.

The file is written at several stages during the pipeline. Initially, result will be Unknown. The above example is how it looks at the end of a successful run - the key thing is that result says Success. If the pipeline detects something wrong during the run, then result will be a list of error messages. For example if too many amplicons have not enough reads to reliably call a consensus, the pipeline will stop and this will be in the output:

"result": ["Too many amplicons are too low depth. STOPPING"]

read_and_primer_stats

This section contains information on mapping all the original input reads to the reference genome, and attempting to allocate them to amplicon(s). Here is an example for paired Illumina reads:

"read_and_primer_stats": {
    "unpaired_reads": 0,
    "reads1": 489949,
    "reads2": 489949,
    "total_reads": 979898,
    "mapped": 971562,
    "match_any_amplicon": 486271,
    "read_lengths": {
      "149": 1023,
      "150": 692551,
      ... etc. key=length, value=number of reads ...
    },
    "amplicon_scheme_set_matches": {
      "COVID-ARTIC-V3;COVID-ARTIC-V4;COVID-MIDNIGHT-1200": 83644,
      "COVID-ARTIC-V3;COVID-MIDNIGHT-1200": 298897,
      "COVID-ARTIC-V3": 84823,
      "COVID-ARTIC-V3;COVID-ARTIC-V4": 18654,
      "COVID-MIDNIGHT-1200": 252,
      "COVID-ARTIC-V4": 1
    },
    "amplicon_scheme_simple_counts": {
      "COVID-ARTIC-V3": 486018,
      "COVID-ARTIC-V4": 102299,
      "COVID-MIDNIGHT-1200": 382793
    },
    "chosen_amplicon_scheme": "COVID-ARTIC-V3"
  }

The first few entries contain the number of reads: these are paired reads, so we have counts for forward reads reads1 and reverse reads reads2, and total_reads = forward plus reverse reads. For unpaired nanopore reads, the read count would be in reads, the reads1/reads2 values would be zero, and reads = total_reads. The mapped entry is simply the total number of mapped reads. The read_lengths entry is a histogram of read length to number of reads (it includes all reads, whether mapped or not).

The match_any_amplicon count is for read pairs if the reads are paired, and for reads if the reads are unpaired. It is the number of (unpaired) reads, or number of fragments/read pairs, that match to any amplicon from any of the amplicon schemes under consideration. For read pairs, the entire fragment (ie start of left read and end of right read) is considered, and therefore the count is for read pairs, not individual reads.

Since amplicon positions can overlap between amplicon schemes, a read (pair) can be allocated to zero, one, or more than one amplicon. The entry amplicon_scheme_set_matches shows the number of reads matching different combinations of schemes. For example, a read could match amplicon 1 from ARTIC V3 and amplicon 1 from Midnight-1200, and in this case the counter for "COVID-ARTIC-V3;COVID-MIDNIGHT-1200" would be incremented.

The entry amplicon_scheme_simple_counts shows the number of reads allocated to each amplicon, ignoring combinations. For example, a read matching both COVID-ARTIC-V3 and COVID-MIDNIGHT-1200 would result in the counters for both those schemes being incremented.

Finally, the entry chosen_amplicon_scheme shows the amplicon scheme that was chosen. Currently the naive method of taking the scheme with the most counts from amplicon_scheme_simple_counts is used. This may change in the future.

Note that there is a top-level entry in the JSON called amplicon_scheme_name. This is the scheme that was actually used. It will usually be the same as chosen_amplicon_scheme. However, if the option to force the scheme was used (--force_amp_scheme) then amplicon_scheme_name will be that forced choice, regardless of the result in chosen_amplicon_scheme.

read_sampling

viridian

self_qc

To be completed

Clone this wiki locally