Skip to content

JSON output file

martinghunt edited this page Jan 26, 2024 · 28 revisions

This page describes the contents of the gzipped JSON file log.json.gz, made when running viridian run_one_sample.

The file contains a dictionary of run details. The main entries (keys) are:

  • run_summary - has high-level details on the run
  • stages_completed - the progress of each main stage in the pipeline
  • reads - high-level summary of read counts
  • read_depth - genome coverage and read depth information
  • amplicon_scheme_name - name of the identified amplicon scheme
  • scheme_choice - details of the amplicon scheme scoring
  • amplicons - details of the amplicon scheme that was used
  • self_qc - details of read pileup information at each masked position
  • sequences - consensus sequence and variations (for MSAs and tree building)

Please read on for more details about the contents of each of those entries.

run_summary

This section is a dictionary with a basic summary of the run. Here is an example (most of the key/value pairs in the options dictionary are omitted for brevity):

"last_stage_completed": "Finished",
"command": "viridian run_one_sample ... full command line used",
"options": {
  "debug": false,
  "outdir": "OUT",
  "force": false,
},
"cwd": "/foo/bar/",
"version": "1.1.0",
"finished_running": true,
"start_time": "2023-09-08T13:37:59+00:00",
"end_time": "2023-09-08T13:39:28+00:00",
"hostname": "thehoff",
"result": "Success",
"errors": [],
"temp_processing_dir": "/tmp/viridian.rxs2ttki",
"total_amplicons": 98,
"successful_amplicons": 98,
"consensus_length": 29836,
"consensus_N_count": 96,
"consensus_N_percent": 0.32,
"consensus_ACGT_count": 29740,
"consensus_ACGT_percent": 99.68,
"consensus_het_count": 0,
"consensus_het_percent": 0.0,
"run_time": "0:01:29.384867"

The most important thing to check is:

"result": "Success"

meaning that the run finished successfully. If instead is says "Fail", then something went wrong and the details will be in the stages_completed section. The other entries should be self-explanatory.

stages_completed

This is a list of the stages that were completed. Each time a stage finishes the json file is written, so that if Viridian crashes or is killed, you can see the last stage that was run.

A successful run looks like this:

"stages_completed": [
  "1/10 Start pipeline (0.0s)",
  "2/10 Process amplicon scheme files (0.1s)",
  "3/10 Map reads to reference (36.8s)",
  "4/10 Detect amplicon scheme (2.7s)",
  "5/10 Sample reads (23.3s)",
  "6/10 Initial consensus sequence (6.1s)",
  "7/10 Initial VCF and MSA of consensus/reference (0.4s)",
  "8/10 QC using reads vs consensus sequence (17.9s)",
  "9/10 Final QC checks (0.1s)",
  "10/10 Tidy up final files and log (1.0s)",
  "Finished"
]

The entries can vary depending on the command line options. For example, if a BAM file of mapped reads was provided, then the "Map reads to reference" stage would not be present. However, the final entry for a successful run is always "Finished".

reads

The "reads" section is a dictionary of summary statistics of the reads. Here is an example for paired Illumina reads:

"reads": {
  "unpaired_reads": 0,
  "reads1": 337637,
  "reads2": 337637,
  "total_reads": 675274,
  "mapped": 667394,
  "match_any_amplicon": 328507,
  "read_lengths": {
    "250": 20,
    "251": 675254
  }
}

The meaning of these should be clear, except for match_any_amplicon. For unpaired reads, this is simply the number of reads that matched to any amplicon in the chosen amplicon scheme (not all schemes under consideration). For paired reads it is the number of read pairs that matched, since both reads within a pair must be considered together when matching to an amplicon - their order and orientation is important.

The "read_lengths" dictionary is a count of the number of reads of each given read length. In that example, there were 20 reads of length 250, and the remaining 675254 reads all had length 251.

read_depth

This has a summary of the read depth and genome coverage. Here is an example:

"read_depth": {
  "depth_at_least": {
    "1": 29865,
    "2": 29862,
    "5": 29862,
    "10": 29836,
    "15": 29836,
    "20": 29794,
    "50": 29600,
    "100": 29600
  },
  "percent_at_least_x_depth": {
    "1": 99.87,
    "2": 99.86,
    "5": 99.86,
    "10": 99.78,
    "15": 99.78,
    "20": 99.64,
    "50": 98.99,
    "100": 98.99
  },
  "mean_depth": 5470.33,
  "mode_depth": 7393,
  "median_depth": 5051
}

These are all based on read mapping to the genome without using any information on amplicons schemes. The mean, mode and median depths are calculated with respect to the entire genome (amplicon schemes do not cover the whole genome). In that example, 99.64% of the genome (29794bp) had at least 20X read depth. This is the value used during QC (the options --coverage_min_x and --coverage_min_pc), where by default Viridian requires at least 50 percent of the genome with at least 20X read depth

amplicon_scheme_name

This is simply a key/value pair with the chosen amplicon scheme, for example:

"amplicon_scheme_name": "COVID-ARTIC-V3"

scheme_choice

This section has details of the amplicon scheme scores, and which scheme was chosen as best matching the reads. Example:

"scheme_choice": {
  "scores": {
    "COVID-ARTIC-V3": 4902,
    "COVID-ARTIC-V4.1": 808,
    "COVID-ARTIC-V5.0-5.3.2_400": 293,
    "COVID-ARTIC-V5.0-5.2.0_1200": 184,
    "COVID-MIDNIGHT-1200": 320,
    "COVID-AMPLISEQ-V1": -193,
    "COVID-VARSKIP-V1a-2b": 59
  },
  "best_schemes": [
    "COVID-ARTIC-V3"
  ],
  "best_score": 4902,
  "best_scheme": "COVID-ARTIC-V3",
  "score_ratio": 0.16
}

In that example, the best scheme was COVID-ARTIC-V3, with a score of 4902. The second-best scheme was COVID-ARTIC-V4.1 with a score of 808. The ratio of these (score_ratio) was 808/4902 = 0.16.

By default, the best score needs to be at least 250, and the ratio no more than 0.5 (options --min_scheme_score and --max_scheme_ratio).

The best_schemes entry is a list to allow for the extremely unlikely (and never seen!) case that two schemes score equally well. If this happened, then the score ratio would be 1 and the run halted.

amplicons

This section contains details of the amplicon scheme that was chosen. It is a list of dictionaries, where each dictionary contains the information for one amplicon.

This is an example amplicon dictionary:

{
  "name": "amplicon_42",
  "primers": {
    "left": [
      {"start": 1242, "end": 1263, "read_count": 940},
    ],
    "right": [
      {"start": 1623, "end": 1650, "read_count": 955}
    ]
  },
  "excluded_primers": {
    "left": [
      {"start": 1200, "end": 1224, "read_count": 0}
    ],
    "right": []
  },
  "start": 1242,
  "end": 1650,
  "dropped": false,
  "reads": {
    "total_reads": 1040,
    "total_reads_fwd_strand": 1020,
    "total_reads_rev_strand": 1020,
    "qc_reads": 2044,
    "qc_bases": 409041,
    "qc_depth": 1000.1,
    "assemble_bases": 40920
  }
}

The primers dictionary shows the primers that were in the amplicon scheme, and supported by reads. The excluded_primers dictionary has the primers that were excluded because of lack of evidence from the reads. The "start" position is the minimum of the left primer start positions, and the "end" position is the maximum of the right primer end positions.

The dropped entry is false if a consensus sequence was successfully produced, otherwise it is true.

In the reads dictionary, the total_* values are for all reads matching to the amplicon. The qc_* are the numbers used after randomly sampling to (by default) 1000X depth. The assemble_bases is the total length of the reads used to make the consensus sequence for the amplicon.

self_qc

This is really intended for debugging. It contains the full pileup information at each masked position of the consensus sequence. Most of this information is also in the QC TSV file qc.tsv.gz.

sequences

This section contains two versions of the consensus sequence:

  • masked_consensus - this is probably the sequence you want. It is the consensus after masking using reads mapped back to the initial consensus sequence. It is the same sequence as that written to the file consensus.fa.gz.
  • unmasked_consensus - the initial consensus sequence before masking.

It also contains various multiple sequence alignment (MSA) sequences:

  • msa_ref/msa_unmasked_consensus/masked_consensus_msa - these all have the same length, and are an MSA of the reference genome, unmasked consensus, and masked consensus sequences. Any of these sequences can contain gaps (- characters).
  • masked_consensus_msa_indel_as_ref/masked_consensus_msa_indel_as_N - these have the same length as the original reference genome, and may be useful for building trees. They are an MSA of the masked consensus and the reference, but gaps in the reference are not allowed (those columns of the MSA are deleted). Gaps in the consensus are replaced either with Ns (masked_consensus_msa_indel_as_N) or with the reference sequence (masked_consensus_msa_indel_as_ref).

The actual JSON entry looks like this, but in this example the sequences have been truncated from their full ~30kbp for readability:

"sequences": {
  "unmasked_consensus": "ACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGC",
  "msa_unmasked_consensus": "------------------------------ACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAA",
  "msa_ref": "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC",
  "masked_consensus": "NNNNNNNNNNNNNNNNNNNNNNNNAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTG",
  "masked_consensus_msa": "------------------------------NNNNNNNNNNNNNNNNNNNNNNNNAGATCTGTTCTCTAAAC",
  "masked_consensus_msa_indel_as_N": "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGATCT",
  "masked_consensus_msa_indel_as_ref": "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGAT"
}
Clone this wiki locally