A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
The nf-core/bactmap pipeline comes with documentation about the pipeline, found in the docs/
directory:
- Installation
- Pipeline configuration
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
This pipeline maps paired end short reads to a bacterial fasta reference sequence, calls qnd filters variants, produces a whole genome alignment from pseudogenomes derived from the variants and finally produces a robust maximum likelihood phylogentic tree.
The steps are:
- Index a reference sequnce using bwa (the reference sequence must only contain the chromosome and no additional sequences such as plasmids).
- (Optional) Fetch reads from the ENA
- Trim reads using trimmomatic (dynamic MIN_LEN based on 33% of the read length)
- Count number of reads and estimate genome size using Mash
- Downsample reads if the
--depth_cutoff
argument was specified - Map reads to the specified reference genome with bwa mem
- Call variants with samtools
- Filter variants to flag low quality SNPs
- Produce a pseudogenome based on the variants called. Missing positions are encoded as
-
characters and low quality positions asN
- All pseudogenomes are concatenanted to make a whole genome alignment
- (Optional) Recombination is removed from the alignment using gubbins
- Invariant sites are removed using snp-sites
- (Optional) Maximum likelihood tree generated using IQ-TREE
A sumary of this process is shown below in the diagram that was generated when running Nextflow using the -with-dag command
These will be found in the directory specified by the --output_dir
argument
- (Optional) If accession numbers were used as the input source a directory called
fastqs
will contain the fastq file pairs for each accession number - A directory called
trimmed_fastqs
containing the reads after trimminb with TRIMMOMATIC - A directory called
sorted_bams
containing the alignmed sam files after mapping with bwa mem, conversion to bam and sorting - A directory called
filtered_bcfs
containing binary vcf files after filtering to flag low quality positions with LowQual in the FILTER column - A directory called
pseudogenomes
containing- the pseudogenome from each sample
- a whole genome alignment named
aligned_pseudogenome.fas
containing the concatenated sample pseudogenomes and the refrerence genome - a variant only alignment named
aligned_pseudogenome.variants_only.fas
with the invariant sites removed fromaligned_pseudogenome.fas
using snp-sites. If recombination removal was specified, the file will be namedaligned_pseudogenome.gubbins.variants_only.fas
with gubbins having been applied prior to invariant site removal.
- Two newick tree files
aligned_pseudogenome.gubbins.variants_only.contree
If tree generation was specified, this file containing the consensus tree from IQ_TREE will be produced. The tree will possess assigned branch supports where branch lengths are optimized on the original alignment. If recombination removal was not specified the file will be namedaligned_pseudogenome.variants_only.contree
aligned_pseudogenome.gubbins.variants_only.treefile
The original IQ-TREE maximum likelihood tree without branch supports. If recombination removal was not specified the file will be namedaligned_pseudogenome.variants_only.treefile
nf-core/bactmap was originally written by Anthony Underwood.
- Trimmomatic A flexible read trimming tool for Illumina NGS data.
- mash Fast genome and metagenome distance estimation using MinHash.
- seqtk A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
- bwa mem Burrow-Wheeler Aligner for short-read alignment
- samtools Utilities for the Sequence Alignment/Map (SAM) format
- bcftools Utilities for variant calling and manipulating VCFs and BCFs
- filtered_bcf_to_fasta.py Python utility to create a pseudogenome from a bcf file where each position in the reference genome is included
- gubbins Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
- snp-sites Finds SNP sites from a multi-FASTA alignment file
- IQ-TREE Efficient software for phylogenomic inference