This workflow is an implementation of the popular DADA2 tool. I followed the steps in the Tutorial. I utilized Kraken2 for ASV sequence classification instead of IDTaxa.
The pipeline was inspired by the Silas Kieser (@silask)'s dada2 snakemake pipeline.
- Rene Welch (@ReneWelch)
There are two steps to install the pipeline:
-
Install snakemake and dependencies: Assuming conda is already installed, and a copy of this repository has been downloaded. Then, the pipeline can be installed by:
conda env create -n {env_name} --file dependencies.yml conda env create -n microbiome --file dependencies.yml
Alternatively, mamba could be used:
mamba env create -n {env_name} --file dependencies.yml mamba env create -n microbiome --file dependencies.yml
-
Install R packages: To install the R packages used by this pipeline use:
R CMD BATCH --vanilla ./install_r_packages.R
If you have other versions of R and R user libraries elsewhere, you might encounter some problems with environment variables and conda. You may need to provide a local .bashrc
file to place the conda path at the beginning of your $PATH
environment variable.
You may not need to do this step.
source /etc/bash_completion.d/git
__conda_setup="$('/path/to/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/path/to/miniconda3/etc/profile.d/conda.sh" ]; then
. "/path/to/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/path/to/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
alias R='R --vanilla'
export PATH="/path/to/miniconda3/envs/microbiome/bin:$PATH"
-
Set up a data/ directory with one subdirectory per batch. The pipeline will look in the
data/
directory to find subdirectories that containfastq.gz
files. You can specify different filtering and trimming parameters per batch, anddada2
will learn error rates per batch. So, you may wish to separate your files by sequencing batch and/or by sample type.mkdir data/ mkdir data/batch01 [data/batch02 ... ]
Next, populate the subdirectories with your
fastq.gz
(orfastq
) files, making sure to include both R1 and R2. If your data are already located elsewhere in your file system, you can use symbolic links (symlinks) to avoid duplicating data:fqdir=/path/to/existing_fastq_files/ batchdir=data/batch01 ln -s ${fqdir}/*.fastq.gz ${batchdir}
-
Generate sample table. The input for the pipeline are the sequencing files separated by batch in the
data/
directory. Using:Rscript ./prepare_sample_table.R
will generate the
samples.tsv
file that contains 4 columns separated by tabs (the file will not actually contain |)| batch | key | end1 | end2 |
-
Set up sample metadata file. This is where to link sample IDs to variables of interest to your study (condition, body site, time point, etc). As currently implemented, this is only used to augment the final TreeSummarizedExperiment file, which you will use for your own downstream analysis. You need to have a column
key
that matches the sample table. For example:| key | subject_id | body_site | time_point | ...
The default path for this file is
data/meta.tsv
. If you have it named otherwise, edit the variablemetadata
inconfig/config.yaml
to point to your file. -
Set up negative control mapping file. If you have negative control samples, you should make a table linking each study sample to its negative control kit(s). As currently implemented, this needs to be a
qs
file containing a tibble of this format:# A tibble: 3 x 3 batch key kits <chr> <chr> <list> 1 batch01 sample_1 <chr [3]> 2 batch01 sample_2 <chr [1]> 3 batch02 sample_3 <chr [1]>
Each element of
key
is a study sample ID matching the sample table. Each element ofkits
is a character list of sample keys for negative control samples, egc("sample_141", "sample_142", "sample_143")
.The default path for this file is
data/negcontrols.qs
. If you have it named otherwise, edit the variablenegcontroltable
inconfig/config.yaml
to point to your file.
Now you will edit config.yaml
to set up running parameters. At the minimum, you will need to make these edits:
-
Confirm general parameters and file paths. Check that these variables are set to your satisfaction:
# general parameters for run asv_prefix: "asv" # this will be the prefix for the ASV IDs sample_table: "samples.tsv" # this file is generated by prepare_sample_table.R negcontroltable: "data/negcontrols.qs" # manually generated metadata: "data/meta.tsv" # manually generated threads: 8 # max number of threads for a multithreaded snakemake rule, eg fastqc in quality_control
-
Run sequence QC. If you have not already looked at your data, the first thing you will want to do is run fastQC to generate quality profiles.
First, run
snakemake -j{cores} sequence_qc
to run fastQC, plot quality profiles, and build amultiqc
report. You can find the individual plots inworkflow/report/quality_profiles
and the full report inoutput/quality_control/multiqc
.Next, inspect those profiles. You may want to use them to decide where you want to trim R1 and R2 and how to set minimum overlap.
-
Edit parameters for each batch. Check the
config.yaml
file for the default filter_and_trim, merge_pairs, and qc_parameters specifications for each batch. Keep in mind the length of your 16S region (eg, the V4 region is 252 bp long) and whether you will have enough overlap after trimming. Look up otherdada2
resources for more advice on how to choose these parameters.Here is an example for a 2x150 NextSeq run targeting the V4 region. This set of parameters keeps only full-length reads and requires an overlap of 25 base pairs. Because the V4 region is only ~252 base pairs long, we suspect that ASVs that are outside of that range may be bimeras.
filter_and_trim: batch01: truncQ: 0 trimLeft: 0 trimRight: 0 maxLen: Inf minLen: 150 maxN: 0 minQ: 0 maxEE: [2, 2] ... merge_pairs: batch01: minOverlap: 25 maxMismatch: 0 ... qc_parameters: negctrl_prop: 0.5 max_length_variation: 5 low_abundance_perc: 0.0001
The defaults provided in the pipeline are from a 2x300 run targeting V3-V4. Here we are truncating the ends of both reads because we noticed a significant loss of quality. Because we have longer reads, we can require more overlap.
filter_and_trim: dust_dec2018: truncQ: 12 truncLen: [280, 250] trimLeft: 0 trimRight: 0 maxLen: Inf minLen: 100 maxN: 0 minQ: 0 maxEE: [2, 2] .... merge_pairs: dust_dec2018: minOverlap: 50 maxMismatch: 0 ... qc_parameters: negctrl_prop: 0.5 max_length_variation: 50 low_abundance_perc: 0.0001
Databases need to be downloaded from https://benlangmead.github.io/aws-indexes/k2. These are the four used by default in the config file.
Here is an example of doing this using wget
:
# download references
wdir="https://genome-idx.s3.amazonaws.com/kraken"
refdir="data/kraken_dbs"
mkdir -p $refdir
greengenes="16S_Greengenes13.5_20200326"
rdp="16S_RDP11.5_20200326"
silva="16S_Silva138_20200326"
minikraken="minikraken2_v2_8GB_201904"
for db in $greengenes $rdp $silva $minikraken;
do
remote=${wdir}/${db}.tgz
local=${refdir}/${db}.tgz
wget $remote -O $local
tar -xzf ${local} -C ${refdir}
done
Replace {cores}
with the maximum number of cores that you want
to be used at one time.
The rules taxonomy
and phylotree
will take a little longer the first
time they are run because they install conda environments.
snakemake -j{cores}
runs everythingsnakemake -j{cores} sequence_qc
plots quality profiles, and build amultiqc
reportsnakemake -j{cores} [--nt] dada2
computes the ASV matrix from the different batches (use --nt option to keep intermediate files for troubleshooting)snakemake -j{cores} taxonomy --use-conda
to labels the ASV sequences with kraken2. Databases need to be downloaded in advance from https://benlangmead.github.io/aws-indexes/k2snakemake -j{cores} phylotree --use-conda
computes the phylogenetic tree usingqiime2
's FastTreesnakemake -j{cores} mia
prepare theTreeSummarizedExperiment
containing all the data generated
Callahan, B., McMurdie, P., Rosen, M. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016). https://doi.org/10.1038/nmeth.3869
Wood, D., Lu, J., Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
McMurdie, P., Holmes, S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLOS One 8, 4 (2013). https://doi.org/10.1371/journal.pone.0061217
Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, et al. 2018. “QIIME 2: Reproducible, Interactive, Scalable, and Extensible Microbiome Data Science.” e27295v2. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295v2.
Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2010. “FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments.” PloS One 5 (3): e9490.
Huang, Ruizhu, Charlotte Soneson, Felix G. M. Ernst, Kevin C. Rue-Albrecht, Guangchuang Yu, Stephanie C. Hicks, and Mark D. Robinson. 2020. “TreeSummarizedExperiment: A S4 Class for Data with Hierarchical Structure.” F1000Research 9 (October): 1246.