Skip to content

Minos in single sample mode

martinghunt edited this page Aug 9, 2022 · 4 revisions

This page describes running Minos on a single sample. The use-case is that you have one or more VCF files (probably generated using different methods), with calls with respect to the same reference genome. These calls can overlap and/or be at the same positions, or be completely different -- it does not matter and Minos will handle this by merging all overlapping variants. Minos will take all variants from all input VCF files, and genotype your sample at each site. The output is a single VCF file.

Input files

Required input files:

  1. The reference genome in FASTA format (must be the same reference as used in all VCF files).
  2. At least one VCF file of calls (with respect to the reference FASTA from 1)
  3. At least one reads file, in FASTA/Q or BAM format.

Input VCF files

Only VCF records that have the GT field present are used from the input VCF file(s). All other records are ignored. Further, only the called alleles from each VCF record are used, and only alleles that comprise of one or more A,G,C,T characters. Here is an example VCF input file (header missing for brevity):

ref_name 100 . C A          . . . GT  0/0
ref_name 101 . C <FOO>      . . . GT  1/1
ref_name 102 . T A,G        . . . GT  0/1
ref_name 103 . G A,C,<BAR>  . . . GT  1/2
ref_name 104 . C T          . . . foo bar

The resulting variants that would be considered for genotyping are: T102A, G103A, G103C. All others are excluded:

  • C100A has reference genotype
  • C101<FOO> only has a non-ACGT allele genotyped
  • T102G was not genotyped (this position was genotyped as the T or A allele)
  • T103<BAR> was not genotyped and is non-ACGT
  • C104T has no GT field.

Command line examples

Example command, where we have two VCF files of variants:

minos adjudicate --reads reads.fastq.gz outdir ref.fasta 1.vcf 2.vcf

Notes:

  • the --reads option can be used as many times as you like - once for each reads file. Paired info is not used, so the order of these files does not matter, eg if you have paired reads in two files.
  • outdir should not exist, and is the output directory that will be made to store all output files (you can add the --force option to overwrite outdir if you're feeling confident)
  • You can list as many VCF files as you like at the end of the command - in that example there are two files.

Example command, where we have two reads files, three VCF files of variants, and overwrite the output directory (if it exists already):

minos adjudicate --force --reads reads1.fastq.gz --reads reads2.fastq.gz outdir ref.fasta 1.vcf 2.vcf 3.vcf

Use the --sample_name option to put the name of your sample into the final VCF file:

minos adjudicate --sample_name sample_42 --reads reads.fastq.gz outdir ref.fasta 1.vcf 2.vcf

Output files

The important output files are:

  1. final.vcf - this is a VCF file containing the final call set, and is most likely the only file you need.
  2. log.txt - logging information.
  3. debug.calls_with_zero_cov_alleles.vcf - this is the initial call set. It includes all sites (and all their alleles) considered by minos, after combining all the original input VCF files. Alleles with no coverage are removed from this to make the final call set in final.vcf. If you want a consistent VCF file across multiple runs, eg using the same input VCF files, but each sample has their own reads file, then you may want to use debug.calls_with_zero_cov_alleles.vcf to compare the outputs.

Call filters

The VCF file final.vcf has these filters implemented in the FILTER column:

  1. MIN_DP - requires the total read depth to be at least 2.
  2. MAX_DP - requires the total read depth to be less than the mean plus 3 standard deviations.
  3. MIN_FRS - "minimum fraction of read support" - requires at least 90% of the reads to support the called allele.
  4. MIN_GCP - "minimum genotype confidence percentile" - this is a "normalised" genotype confidence score that can be used across samples. It is described in full in the Minos publication. Briefly, the GCP is the percentile of the genotype confidence score inside the expected distribution from the genotyping model. Since read depth is a parameter of the model, this is different for each run of Minos (unless two samples happen to have identical read depth/standard deviation etc). A call must have GCP of at least 0.5% to pass.

All of the default filter cutoffs can be changed, using the options --filter_min_dp, --filter_min_frs, --filter_min_gcp, and --filter_max_dp. However, we recommend keeping the defaults unless you have a compelling reason to change them.

Advanced options

We do not recommend changing any options other than those listed above.