diff --git a/README.md b/README.md
index 4acdfe9..57b75fd 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
[![release](https://img.shields.io/github/v/release/Aveglia/vAMPirus?label=release&logo=github)](https://github.com/Aveglia/vAMPirus/releases/latest)
# Table of contents
-
+* [New in vAMPirus version 2.0.0](#New-in-vAMPirus-version-2.0.0)
* [Quick intro](#Quick-intro)
* [Contact/support](#Contact/support)
* [Getting started](#Getting-started)
@@ -21,21 +21,39 @@
* [Running vAMPirus](#Running-vAMPirus)
* [Who to cite](#Who-to-cite)
+# New in vAMPirus version 2.0.0
+
+1. (EXPERIMENTAL) Added Minimum Entropy Decomposition analysis using the oligotyping program produced by the Meren Lab. This allows for sequence clustering based on sequence positions of interest (biologically meaningful) or top positions with the highest Shannon's Entropy (read more here: https://merenlab.org/software/oligotyping/ ; and below).
+
+2. Added more useful taxonomic classification of sequences leveraging the RVDB annotation database and/or NCBI taxonomy files (see manual for more info).
+
+3. Replaced the used of MAFFT with muscle v5 (Edgar 2021) for more accurate virus gene alignments (see https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full).
+
+4. Added multiple primer pair removal to deal with multiplexed amplicon libraries.
+
+5. ASV filtering - you can now provide a "filter" and "keep" database to remove certain sequences from the analysis
+
+6. Reduced redundancy of processes and the volume of generated result files per full run (Example - read processing only done once if running DataCheck then Analyze).
+
+7. Color nodes on phylogenetic trees based on Taxonomy or Minimum Entropy Decomposition results
+
+8. PCoA plots added to Analyze report if NMDS does not converge.
# Quick intro
-Viruses are the most abundant biological entities on the planet and with advances in next-generation sequencing technologies, there has been significant effort in deciphering the global virome and its impact in nature (Suttle 2007; Breitbart 2019). A common method for studying viruses in the lab or environment is amplicon sequencing, an economic and effective approach for investigating virus diversity and community dynamics. The highly targeted nature of amplicon sequencing allows in-depth characterization of genetic variants within a specific taxonomic grouping facilitating both virus discovery and screening within samples. Although, the high volume of amplicon data produced combined with the highly variable nature of virus evolution across different genes and virus-types can make it difficult to scale and standardize analytical approaches. Here we present vAMPirus (https://github.com/Aveglia/vAMPirus.git), an automated and easy-to-use virus amplicon sequencing analysis program that is integrated with the Nextflow workflow manager facilitation easy scalability and standardization of analyses.
+![vAMPirus general workflow](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vAMPirus_generalflow.png)
+
The vAMPirus program contains two different pipelines:
1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform future analyses
-![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_datacheckUPDATED.png)
+![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_datacheckV2.png)
2. Analyze pipeline: a comprehensive analysis of the provided data producing a wide range of results and outputs which includes an interactive report with figures and statistics. NOTE- stats option has changed on 2/19/21; you only need to add "--stats" to the launch commmand without "run"
-![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_analysisUPDATED.png)
+![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_analyzeV2.png)
NOTE => This is a more brief overview of how to install and set up vAMPirus, for more detail see the [manual](https://github.com/Aveglia/vAMPirus/blob/master/docs/HelpDocumentation.md).
@@ -50,28 +68,24 @@ If you have a feature request or any feedback/questions, feel free to email vAMP
## Quick order of operations
-1. Clone vAMPirus from github
-2. Execute the vampirus_startup.sh script to install dependencies and any databases specified
-3. Test installation with supplied test dataset
-5. Launch the DataCheck pipeline with your dataset and adjust parameters if necessary
-6. Launch the Analyze pipeline with your dataset
-
-### Dependencies (see Who to cite section)
-
-1. Python versions 3.6/2.7
-2. Diamond version 0.9.30
-3. FastQC version 0.11.9
-4. fastp version 0.20.1
-5. Clustal Omega version 1.2.4
-6. IQ-TREE version 2.0.3
-7. ModelTest-NG version 0.1.6
-8. MAFFT version 7.464
-9. vsearch version 2.14.2
-10. BBMap version 38.79
-11. trimAl version 1.4.1
-12. CD-HIT version 4.8.1
-13. EMBOSS version 6.5.7.0
-14. seqtk version 1.3
+1. Clone vAMPirus from github
+
+2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases (NOTE: You will need to have the xz program installed before running startup script when downloading the RVDB database)
+
+3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see EXAMPLE_COMMANDS.txt in the vAMPirus directory for test commands and other examples)
+
+4. Edit parameters in vampirus.config file
+
+5. Launch the DataCheck pipeline to get summary information about your dataset (e.g. sequencing success, read quality information, clustering behavior of ASV or AminoTypes)
+
+6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length, Shannon entropy analysis results)
+
+7. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset
+
+8. Explore results directories and produced final reports
+
+
+### Installing dependencies (see Who to cite section)
If you plan on using Conda to run vAMPirus, all dependencies will be installed as a Conda environment automatically with the vampirus_startup.sh script.
@@ -136,11 +150,11 @@ The startup script provided in the vAMPirus program directory will install Conda
### vAMPirus startup script
-
To set up and install vAMPirus dependencies, simply move to the vAMPirus directory and run the vampirus_startup.sh script.
cd ./vAMPirus; bash vampirus_startup.sh -h
+
>You can make the vampirus_startup.sh scrip an exectuable with -> chmod +x vampirus_startup.sh ; ./vampirus_startup.sh
@@ -152,37 +166,41 @@ You can also use the startup script to install different databases to use for vA
2. The proteic version of the Reference Viral DataBase (RVDB) (See https://f1000research.com/articles/8-530)
3. The complete NCBI NR protein database
-To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option.
+To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option and you can download the NCBI taxonomy files with the option "-t" (See below).
-If we look at the script usage:
+If we take a look at the vampirus_startup.sh script usage:
- General execution:
+General execution:
- vampirus_startup.sh -h [-d 1|2|3|4] [-s]
+vampirus_startup.sh -h [-d 1|2|3|4] [-s] [-t]
- Command line options:
+ Command line options:
- [ -h ] Print help information
+ [ -h ] Print help information
- [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment:
+ [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment:
- 1 - Download the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530)
- 2 - Download only NCBIs Viral protein RefSeq database
- 3 - Download only the complete NCBI NR protein database
- 4 - Download all three databases
+ 1 - Download only the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530)
+ 2 - Download only NCBIs Viral protein RefSeq database
+ 3 - Download only the complete NCBI NR protein database
+ 4 - Download all three databases
- [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container)
+ [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container)
+ [ -t ] Set this option to download NCBI taxonomy files needed for DIAMOND to assign taxonomic classification to sequences (works with NCBI type databases only, see manual for more information)
-For example, if you would like to install Nextflow, download NCBIs Viral protein RefSeq database, and check/install conda, run:
- bash vampirus_startup.sh -d 1
+For example, if you would like to install Nextflow, download NCBIs Viral Protein RefSeq database, the NCBI taxonomy files to use DIAMOND taxonomy assignment feature, and check/install conda, run:
+
+ bash vampirus_startup.sh -d 2 -t
and if we wanted to do the same thing as above but skip the Conda check/installation, run:
- bash vampirus_startup.sh -d 1 -s
+ bash vampirus_startup.sh -d 2 -s
+
+NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed.
-NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed. Then move to the vAMPirus directory and run the test commands.
+**NEW in version 2.0.0 -> the startup script will automatically download annotation information from RVDB to infer Lowest Common Ancestor (LCA) information for hits during taxonomy assignment. You can also use "-t" to download NCBI taxonomy files to infer taxonomy using the DIAMOND taxonomy classification feature.
# Testing vAMPirus installation
@@ -203,11 +221,11 @@ OR
### Analyze test =>
- /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats
+ `/path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze`
OR
- nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats
+ `nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze`
# Running vAMPirus
@@ -215,11 +233,12 @@ OR
If you done the setup and confirmed installation success with the test commands, you are good to get going with your own data. Before getting started edit the configuration file with the parameters and other options you plan to use.
Here are some example vAMPirus launch commands:
+
### DataCheck pipeline =>
-Example 1. Launching the vAMPirus DataCheck pipeline using conda
+Example 1. Launching the vAMPirus DataCheck pipeline using conda and Shannon Entropy Analysis on ASVs and AminoTypes
- nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck
+ `nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck --asvMED --aminoMED`
Example 2. Launching the vAMPirus DataCheck pipeline using Singularity and multiple primer removal with the path to the fasta file with the primer sequences set in the launch command
@@ -229,21 +248,19 @@ Example 3. Launching the vAMPirus DataCheck pipeline with primer removal by glob
nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck --GlobTrim 20,26
-
### Analyze pipeline =>
Example 4. Launching the vAMPirus Analyze pipeline with singularity with ASV and AminoType generation with all accesory analyses (taxonomy assignment, EMBOSS, IQTREE, statistics)
- nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats
+ `nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats`
Example 5. Launching the vAMPirus Analyze pipeline with conda to perform multiple primer removal and protein-based clustering of ASVs, but skip most of the extra analyses
nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --pcASV --skipPhylogeny --skipEMBOSS --skipTaxonomy --skipReport
-Example 6. Launching vAMPirus Analyze pipeline with conda to produce only ASV-related results
-
- nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats
+Example 6. Launching vAMPirus Analyze pipeline with conda to produce only ASV and AminoType-based results with Shannon Entropy Analyses with the nodes on produced phylogenies colored based on taxnomy hit
+ `nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --asvMED --aminoMED --nodeCol TAX --stats`
## Resuming analyses =>
@@ -251,8 +268,7 @@ If an analysis is interrupted, you can use Nextflows "-resume" option that will
For example if the analysis launched with the command from Example 6 above was interrupted, all you would need to do is add the "-resume" to the end of the command like so:
- nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats -resume
-
+ `nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --asvMED --aminoMED --nodeCol TAX --stats -resume`
# Who to cite:
@@ -260,7 +276,7 @@ If you do use vAMPirus for your analyses, please cite the following ->
1. vAMPirus - Veglia A.J., Rivera Vicéns R.E., Grupstra C.G.B., Howe-Kerr L.I., Correa A.M.S. (2021) vAMPirus: An automated, comprehensive virus amplicon sequencing analysis program (Version v1.0.1). Zenodo. http://doi.org/10.5281/zenodo.4549851
-2. Diamond - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176
+2. DIAMOND - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176
3. FastQC - Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
@@ -272,7 +288,7 @@ If you do use vAMPirus for your analyses, please cite the following ->
7. ModelTest-NG - Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B., & Flouri, T. (2020). ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Molecular biology and evolution, 37(1), 291-294.
-8. MAFFT - Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780.
+8. muscle v5 - R.C. Edgar (2021) "MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping" https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full.pdf
9. vsearch - Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584.
@@ -287,3 +303,5 @@ If you do use vAMPirus for your analyses, please cite the following ->
14. seqtk - Li, H. (2012). seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub, 767, 69.
15. UNOISE algorithm - R.C. Edgar (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257
+
+16. Oligotyping - A. Murat Eren, Gary G. Borisy, Susan M. Huse, Jessica L. Mark Welch (2014). Oligotyping analysis of the human oral microbiome. Proceedings of the National Academy of Sciences Jul 2014, 111 (28) E2875-E2884; DOI: 10.1073/pnas.1409644111
diff --git a/bin/muscle5.0.1278_linux64 b/bin/muscle5.0.1278_linux64
new file mode 100644
index 0000000..53f3e8f
Binary files /dev/null and b/bin/muscle5.0.1278_linux64 differ
diff --git a/bin/vAMPirus_DC_Report.Rmd b/bin/vAMPirus_DC_Report.Rmd
index 3320070..c986bb5 100644
--- a/bin/vAMPirus_DC_Report.Rmd
+++ b/bin/vAMPirus_DC_Report.Rmd
@@ -4,19 +4,12 @@ date: "Generated on: `r Sys.time()`"
output: html_document
params:
interactive: TRUE
- fastpcsv: !r commandArgs(trailingOnly=T)[2]
- reads_per_sample_preFilt: !r commandArgs(trailingOnly=T)[3]
- read_per_sample_postFilt: !r commandArgs(trailingOnly=T)[4]
- preFilt_baseFrequency: !r commandArgs(trailingOnly=T)[5]
- postFilt_baseFrequency: !r commandArgs(trailingOnly=T)[6]
- preFilt_qualityScore: !r commandArgs(trailingOnly=T)[7]
- postFilt_qualityScore: !r commandArgs(trailingOnly=T)[8]
- preFilt_averageQuality: !r commandArgs(trailingOnly=T)[9]
- postFilt_averageQuaulity: !r commandArgs(trailingOnly=T)[10]
- preFilt_length: !r commandArgs(trailingOnly=T)[11]
- postFilt_length: !r commandArgs(trailingOnly=T)[12]
- number_per_percentage_nucl: !r commandArgs(trailingOnly=T)[13]
- number_per_percentage_prot: !r commandArgs(trailingOnly=T)[14]
+ projtag: !r commandArgs(trailingOnly=T)[1]
+ skipReadProcessing: !r commandArgs(trailingOnly=T)[2]
+ skipMerging: !r commandArgs(trailingOnly=T)[3]
+ skipAdapterRemoval: !r commandArgs(trailingOnly=T)[4]
+ asvMED: !r commandArgs(trailingOnly=T)[5]
+ aminoMED: !r commandArgs(trailingOnly=T)[6]
---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE,
+ message = FALSE,
+ warning = FALSE,
+ out.width="100%")
+```
+
+```{r pathways, echo=FALSE}
+knitr::include_graphics("vamplogo.png")
+```
+
+```{r load_libraries, include=FALSE}
+library(vegan)
+library(tidyverse)
+library(scales)
+library(cowplot)
+library(dplyr)
+library(ggtree)
+library(plotly)
+library(knitr)
+library(kableExtra)
+library(rmarkdown)
+library(processx)
+library(ape)
+```
+
+```{r colors, include=FALSE}
+mycol=c('#088da5','#73cdc8','#ff6f61','#7cb8df','#88b04b','#00a199','#6B5B95','#92A8D1','#b0e0e6','#ff7f50','#088d9b','#E15D44','#e19336')
+```
+
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report.
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
+
-
-
-
-
-```{r tree, echo=FALSE}
-#tree=read.newick("vAMPrun_otu.55.raxml.support")
-tree=read.tree(params$tree)
-p1 <- ggtree(tree)
-id <- tree$tip.label
-dat <- tibble::tibble(id = id)
-metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id'))
-p2 <- p1 + geom_point(data = metat, aes(x = x, y = y, label = id))
-ggplotly(p2, tooltip = "label")
-```
-This tree is a maximum likelihood tree made with IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL.
-
-
-
-
-
-
-
-
-
diff --git a/bin/virtualribosomev2/dna2pep.py b/bin/virtualribosomev2/dna2pep.py
old mode 100644
new mode 100755
index 6f69015..73196d6
--- a/bin/virtualribosomev2/dna2pep.py
+++ b/bin/virtualribosomev2/dna2pep.py
@@ -1,4 +1,6 @@
-#!/usr/bin/env python2
+#!/usr/bin/env python3
+
+# Script was modified form original to use python3
# Copyright 2006 Rasmus Wernersson, Technical University of Denmark
#
@@ -29,60 +31,60 @@
SYNOPSIS
dna2pep [options] [input files] [-f outfile]
-
+
DESCRIPTION
TRANSLATION: The translation engine of dna2pep has full support for handling
degenerate nucleotides (IUPAC definition, e.g. W = A or T, S = G or C).
- All translation table defined by the NCBI taxonomy group is included,
+ All translation table defined by the NCBI taxonomy group is included,
and a number of options determining the behaviour of STOP and START
codons is avialable.
-
- INTRON and EXONS: dna2pep natively understands TAB files containing
+
+ INTRON and EXONS: dna2pep natively understands TAB files containing
Intron/Exon annotation (gb2tab / FeatureExtract). When translating
files containing Intron/Exon structure, dna2pep will annotate the
underlying gene-structure in the annotation of the translated
sequence.
-
+
Input files can be in FASTA (no Intron/Exon annotation) RAW (single
sequence with no header - all non-letters are discarded) or TAB
(incluing annotation) FORMAT. The output format will by default be FASTA
- for files without annotation and TAB for files including annotation.
+ for files without annotation and TAB for files including annotation.
The file format is autodetected by investigating the first line
of the input.
If no input files are specified, dna2pep will read from STDIN.
-
+
OPTIONS
-F, --outfile
- Optional - specify an output file. If no output file is
+ Optional - specify an output file. If no output file is
specified the output will go to STDOUT.
-
+
-O, --outformat
- Specify output format (see also the --fasta, --tab,
+ Specify output format (see also the --fasta, --tab,
--report options below):
-
+
FASTA: Fasta format (plain DNA, no sequence annotation)
-
+
TAB: Tab format. Each line contains the following four
fields, separated by tabs:
name, seq, ann, comment
-
+
See gb2tab (FeatureExtract) for details.
-
- REPORT: A nice visualization of the results.
-
- AUTO: [Default] Generate a both a report and sequence output
- (use the same format as the one detected from the for
+
+ REPORT: A nice visualization of the results.
+
+ AUTO: [Default] Generate a both a report and sequence output
+ (use the same format as the one detected from the for
the input files).
--fasta filename
Write output sequences in FASTA format to the specified file.
Use '-' to indicate STDOUT.
-
+
--tab filename
Write output sequences in TAB format to the specified file.
Use '-' to indicate STDOUT.
-
+
--report filename
Write report to the specified file.
Use '-' to indicate STDOUT.
@@ -90,121 +92,121 @@
-m, --matrix tablename/file
Use alternative translation matrix instead of the build-in
Standard Genetic Code for translation.
-
- If "tablename" is 1-6,9-16 or 21-23 one of the alternative
- translation tables defined by the NCBI taxonomy group will be
+
+ If "tablename" is 1-6,9-16 or 21-23 one of the alternative
+ translation tables defined by the NCBI taxonomy group will be
used.
-
+
Briefly, the following tables are defined:
-----------------------------------------
- 1: The Standard Code
- 2: The Vertebrate Mitochondrial Code
- 3: The Yeast Mitochondrial Code
- 4: The Mold, Protozoan, and Coelenterate Mitochondrial Code
- and the Mycoplasma/Spiroplasma Code
- 5: The Invertebrate Mitochondrial Code
- 6: The Ciliate, Dasycladacean and Hexamita Nuclear Code
- 9: The Echinoderm and Flatworm Mitochondrial Code
- 10: The Euplotid Nuclear Code
- 11: The Bacterial and Plant Plastid Code
- 12: The Alternative Yeast Nuclear Code
- 13: The Ascidian Mitochondrial Code
- 14: The Alternative Flatworm Mitochondrial Code
- 15: Blepharisma Nuclear Code
- 16: Chlorophycean Mitochondrial Code
- 21: Trematode Mitochondrial Code
- 22: Scenedesmus obliquus mitochondrial Code
- 23: Thraustochytrium Mitochondrial Code
-
+ 1: The Standard Code
+ 2: The Vertebrate Mitochondrial Code
+ 3: The Yeast Mitochondrial Code
+ 4: The Mold, Protozoan, and Coelenterate Mitochondrial Code
+ and the Mycoplasma/Spiroplasma Code
+ 5: The Invertebrate Mitochondrial Code
+ 6: The Ciliate, Dasycladacean and Hexamita Nuclear Code
+ 9: The Echinoderm and Flatworm Mitochondrial Code
+ 10: The Euplotid Nuclear Code
+ 11: The Bacterial and Plant Plastid Code
+ 12: The Alternative Yeast Nuclear Code
+ 13: The Ascidian Mitochondrial Code
+ 14: The Alternative Flatworm Mitochondrial Code
+ 15: Blepharisma Nuclear Code
+ 16: Chlorophycean Mitochondrial Code
+ 21: Trematode Mitochondrial Code
+ 22: Scenedesmus obliquus mitochondrial Code
+ 23: Thraustochytrium Mitochondrial Code
+
See http://www.ncbi.nlm.nih.gov/Taxonomy [Genetic Codes]
for a detailed description. Please notice that the table
of start codons is also used (see the --allinternal option
below for details).
-
+
If a filename is supplied the translation table is read from
- file instead.
-
+ file instead.
+
The file should contain one line per codon in the format:
-
+
codonaa-single letter code
-
- All 64 codons must be included. Stop codons is specified
+
+ All 64 codons must be included. Stop codons is specified
by "*". T and U is interchangeable. Blank lines and lines
starting with "#" are ignored.
-
+
See the "gcMitVertebrate.mtx" file in the dna2pep source
distribution for a well documented example.
-r x, --readingframe=x
Specify the reading frame. For input files in TAB format this
options is ignored, and the reading frame is build from the
- annotated Intron/Exon structure.
-
+ annotated Intron/Exon structure.
+
1: Reading frame 1 (e.g. ATGxxxxxx). DEFAULT.
2: Reading frame 2 (e.g. xATGxxxxx).
3: Reading frame 3 (e.g. xxATGxxxx).
-
+
-1: Reading frame 1 on the minus strand.
-2: Reading frame 2 on the minus strand.
-3: Reading frame 3 on the minus strand.
-
- all: Try all reading frames.
+
+ all: Try all reading frames.
This option also implies the -x option.
-
+
plus: All positive reading frames.
This option also implies the -x option.
-
+
minus: All negative reading frames.
This option also implies the -x option.
-
+
-o mode, --orf mode
- Report longest ORF in the reading frame(s) specified with the
+ Report longest ORF in the reading frame(s) specified with the
-r option.
-
- Mode governs which criterias are used to allow the opening of
- an ORF. "Strict start codons" => codons _always_ coding for
- methione (e.g. ATG in the standard code), "Minor start codons"
- => codon only coding for methionine at the start positon
- (e.g. TTG in the standard genetic code).
-
+
+ Mode governs which criterias are used to allow the opening of
+ an ORF. "Strict start codons" => codons _always_ coding for
+ methione (e.g. ATG in the standard code), "Minor start codons"
+ => codon only coding for methionine at the start positon
+ (e.g. TTG in the standard genetic code).
+
Mode can be:
------------
strict: Open an ORF at "strict start codons" only.
any: Open an ORF at any start codon.
- none: Do not use start codons - look for the longest
+ none: Do not use start codons - look for the longest
fragment before a STOP codon.
-
- The DNA fragment usedfor encoding the ORF will be added to the
+
+ The DNA fragment usedfor encoding the ORF will be added to the
comment field (TAB format only).
-
+
-a, --allinternal
By default the very first codon in each sequences is assumed
to be the initial codon on the transcript. This means certain
- non-methionine codons actually codes for metionine at this
+ non-methionine codons actually codes for metionine at this
position. For example "TTG" in the standard genetic code (see
above).
-
- Selecting this option treats all codons as internal codons.
-
+
+ Selecting this option treats all codons as internal codons.
+
-x, --readthroughstop
Allow the translation to continue after a stop codon is reached.
The stop codon will be marked as "*".
-
+
-p, --plain, --ignoreannotation
Ignore annotation for TAB files. If this options is selected
TAB files will be treated in same way as FASTA files.
- -c, --comment
+ -c, --comment
Preserve the comment field in TAB files. Normally the comment
field is silently dropped, since it makes no sense for FASTA
files.
-
+
-C, --processcomment
Works as the -c option described above, except a bit of intelligent
parsing is done on the comment field: If a "/spliced_product"
sub-field is found (from TAB files create by FeatureExtract / gb2tab)
only the part of the comment field before the DNA specific information
- is kept in the comment field.
+ is kept in the comment field.
-e, --exonstructure
Default for TAB files. Annotate the underlying exons structure
@@ -212,19 +214,19 @@
are fully or partially encoded within the first exon get the
annotation character "1", positions in the secon exon get the
character "2" etc.
-
+
The hex-decimal system is used, which means up to 15 exons can
be uniquely annotated, before the numbering wraps around to "0".
-
+
-i, --intronphase
Annotate where an intron interrupted the DNA sequences, and how
the intron did cut the readingframe.
-
+
0 : phase-0 intron (inbetween the previous and current position).
1 : phase-1 intron.
2 : phase-2 intron.
-
-
+
+
AUTHOR
Rasmus Wernersson, raz@cbs.dtu.dk
Feb-Mar 2006
@@ -234,11 +236,11 @@
WEB PAGE
http://www.cbs.dtu.dk/services/VirtualRibosome/
-
+
REFERENCE
Rasmus Wernersson
Virtual Ribosome - Comprehensive DNA translation tool.
- Submitted to Nucleic Acids Research, 2006
+ Submitted to Nucleic Acids Research, 2006
"""
import sys, re, mod_translate,string
@@ -248,7 +250,7 @@
complDNA = "TAACGRYWSMKVHDBN"
allValid = validDNA+validDNA.lower()
-transTable = string.maketrans(validDNA+validDNA.lower(),complDNA+complDNA.lower())
+transTable = str.maketrans(validDNA+validDNA.lower(),complDNA+complDNA.lower())
pwidth = 90
@@ -271,41 +273,41 @@ def makePretty(title,vals,labels,max_len):
val = vals[j]
lab = labels[j]
pos = min(max_len,i+pwidth)
- if lab:
+ if lab:
spos = "%d" % pos
else:
spos = ""
-
+
s = "%-2s %s %s\n" % (lab,val[i:pos],spos)
l.append(s)
l.append("\n")
- return l
+ return l
def explodePep(s):
l = list(s)
return " "+" ".join(l)+" "
-
+
def revCom(dna):
l = list(dna)
l.reverse()
rev_dna = "".join(l)
return rev_dna.translate(transTable)
-
+
def dnaComplement(dna):
return dna.translate(transTable)
-
+
def revStr(s):
l = list(s)
l.reverse()
return "".join(l)
-
+
def isDNAValid(dna):
for c in dna:
if not c in allValid: return False
-
+
return True
-
+
def combineToTab(name, seql):
seq = "".join(seql)
ann= "."*len(seq)
@@ -320,19 +322,19 @@ def readRaw(lines):
sl = []
for c in line.strip().upper():
if c.isalpha(): sl.append(c)
-
+
seql.append("".join(sl))
-
+
l.append(combineToTab("Seq1",seql))
- return l
-
+ return l
+
# Read FASTA format
def readFasta(lines):
l = []
-
+
name = ""
seql = []
-
+
for line in lines:
line = line.strip()
if line.startswith(">"):
@@ -342,104 +344,105 @@ def readFasta(lines):
seql = []
else:
seql.append(line)
-
+
if name: l.append(combineToTab(name,seql))
-
+
return l
-# read TAB format
+# read TAB format
def readTab(lines):
l = []
-
+
for line in lines:
line = line.strip()
tokens = line.split("\t")
if len(tokens) < 2: continue
-
+
name = tokens[0]
seq = tokens[1]
-
+
if len(tokens) > 2:
ann = tokens[2]
else:
ann = "."*len(seq)
-
+
if len(tokens) > 3:
com = tokens[3]
else:
com = ""
-
+
l.append( (name,seq,ann,com) )
-
+
return l
-
+
def readInput(lines):
if not lines: return ([], True)
-
- line = lines[0]
+
+ line = lines[0]
tokens = line.split("\t")
-
+
if line.startswith(">") and len(tokens) < 3:
l = readFasta(lines)
isFasta = True
-
+
elif len(tokens) == 1:
l = readRaw(lines)
isFasta = True
-
+
else:
l = readTab(lines)
isFasta = False
-
+
return (l, isFasta)
-
+
def writeFasta(seqs, outstream):
for (name, seq, ann, com) in seqs:
- print >> outstream, ">"+name
+ print(">"+name, file=outstream)
for i in range(0,len(seq),60):
- print >> outstream, seq[i:i+60]
-
+ print(seq[i:i+60], file=outstream)
+
def writeTab(seqs, outstream):
for tokens in seqs:
- print >> outstream, "\t".join(tokens)
+ print("\t".join(tokens), file=outstream)
def openForWriteOrDie(outfile):
try:
- outstream = file(outfile,"w")
+ outstream = open(outfile,"w")
- except IOError, (strerror):
- print >> sys.stderr, "ERROR - cannot write to the specified file %s [%s]" % (outfile,strerror)
+ except IOError as xxx_todo_changeme:
+ (strerror) = xxx_todo_changeme
+ print("ERROR - cannot write to the specified file %s [%s]" % (outfile,strerror), file=sys.stderr)
sys.exit(-1)
-
+
return outstream
def parseOpts():
- # Quick hack to overrule the -h and --help feature
- # build into the optpase module
- if "-h" in sys.argv or "--help" in sys.argv:
- print __doc__
- sys.exit(0)
+ # Quick hack to overrule the -h and --help feature
+ # build into the optpase module
+ if "-h" in sys.argv or "--help" in sys.argv:
+ print(__doc__)
+ sys.exit(0)
parser = OptionParser()
-
+
# File handling
parser.add_option("-F","--outfile", type="string", dest="outfile", default="")
parser.add_option("-O","--outformat", type="string", dest="outformat", default="AUTO")
parser.add_option("--tab", type="string", dest="tabfile", default="")
parser.add_option("--fasta", type="string", dest="fastafile", default="")
parser.add_option("--report", type="string", dest="reportfile", default="")
-
+
# Matrix
parser.add_option("-m","--matrix", type="string", dest="matrix", default="1")
-
+
# Reading frame
parser.add_option("-r","--readingframe", type="string", dest="readingframe", default="1")
-
- # ORF finding
+
+ # ORF finding
parser.add_option("-o","--orf", type="string", dest="orf", default="")
-
+
# Stop and Start codons
parser.add_option("-a","--allinternal", action="store_true", dest="allinternal", default = False)
parser.add_option("-x","--readthroughstop", action="store_true", dest="readthroughstop", default = False)
@@ -453,40 +456,40 @@ def parseOpts():
parser.add_option("-c","--comment", action = "store_true", dest = "keepcomment", default=False)
parser.add_option("-C","--processcomment", action = "store_true", dest = "processcomment", default=False)
- # Debug
- parser.add_option("-d","--debug", action="store_true", dest="debug", default=False)
-
- (opt, args) = parser.parse_args()
-
+ # Debug
+ parser.add_option("-d","--debug", action="store_true", dest="debug", default=False)
+
+ (opt, args) = parser.parse_args()
+
# Check reading frame
if not opt.readingframe in ["1", "2", "3", "-1", "-2", "-3", "all", "plus","minus"]:
sys.stderr.write("Invalid reading frame [%s]\n" % opt.readingframe)
sys.exit(-1)
-
+
if opt.readingframe in ["all","plus","minus"]:
opt.readthroughstop = True
-
+
# Chech ORF mode
if not opt.orf in ["","strict","any","none"]:
- print >> sys.stderr, "Invalid ORF mode [%s]\n" % opt.orf
+ print("Invalid ORF mode [%s]\n" % opt.orf, file=sys.stderr)
sys.exit(-1)
-
+
# Chech output format
opt.outformat = opt.outformat.upper()
if not opt.outformat in ["AUTO","TAB","FASTA","REPORT"]:
- print >> sys.stderr("Invalid output format [%s]\n" % opt.outformat)
+ print(file=sys.stderr("Invalid output format [%s]\n" % opt.outformat))
sys.exit(-1)
-
+
# Check mutually exclusive options
if opt.intronrf: opt.exonann=False
-
+
return (opt, args)
if __name__ == "__main__":
reports = []
pepseqs = []
opt, args = parseOpts()
-
+
# Initialize translation matrix
mtx = mod_translate.parseMatrixFile(opt.matrix)
if not mtx:
@@ -504,16 +507,16 @@ def parseOpts():
lines = []
for fn in args:
try:
- lines += (file(fn,"r").readlines())
+ lines += (open(fn,"r").readlines())
except:
sys.stderr.write("ERROR: Cannot read from file '%s'\n" % fn)
sys.exit(-1)
-
+
try:
(seq_list, isFasta) = readInput(lines)
- except Exception, msg:
- print >> sys.stderr, "ERROR parsing input files. Please verify the format (FASTA, RAW or TAB)"
- print >> sys.stderr, "[%s]" % (str(msg))
+ except Exception as msg:
+ print("ERROR parsing input files. Please verify the format (FASTA, RAW or TAB)", file=sys.stderr)
+ print("[%s]" % (str(msg)), file=sys.stderr)
sys.exit(-1)
# Ignore annotation?
@@ -525,36 +528,36 @@ def parseOpts():
for (name, dna, ann, com) in seq_list:
seq_proc = ann_proc = exnum_proc = ""
-
+
# Test if the DNA sequence is valid
if not isDNAValid(dna):
sys.stderr.write("Non IUPAC characters is detected in sequence '%s' - skipping this entry\n" %name)
#sys.stderr.write("seq: %s\n" % seq)
continue
-
+
# Files without intron/exon annotation -------------------------------------------
if isFasta:
# reports.append([ORF_ANNOTATION])
-
+
d_collect = {}
if opt.readingframe in ["all"]:
rf_list = ["1","2","3","-1","-2","-3"]
echo_rf = True
-
+
elif opt.readingframe == "plus":
rf_list = ["1","2","3"]
echo_rf = True
-
+
elif opt.readingframe == "minus":
rf_list = ["-1","-2","-3"]
echo_rf = True
-
+
else:
rf_list = [opt.readingframe]
echo_rf = False
-
+
for rf in rf_list:
-
+
# Find current reading frame
if rf == "1":
qseq = dna
@@ -574,19 +577,19 @@ def parseOpts():
# Do the actual translation
pep = mod_translate.translate(qseq,mtx,not opt.allinternal,opt.readthroughstop)
pa = mod_translate.annotate(qseq,mtx)
-
+
# The annotation string may be longer that the peptide, if the -x more is not used
- pa = pa[:len(pep)]
+ pa = pa[:len(pep)]
# Store translated sequence
if echo_rf:
cname = name+"_rframe"+rf
else:
cname = name
-
+
data = ( cname,pep,pa,qseq )
d_collect[rf] = data
-
+
# Do ORF finding?
if opt.orf:
#Find longest ORF
@@ -594,7 +597,7 @@ def parseOpts():
bestspan = (0,0)
bestdata = None
bestrf = ""
- for key in d_collect.keys():
+ for key in list(d_collect.keys()):
data = d_collect[key]
seq = data[1]
ann = data[2]
@@ -617,52 +620,52 @@ def parseOpts():
j = j_strict
else:
j = min(j_strict,j_any)
-
+
if j == -1: break
-
+
#Step 2 - find stop
m = ann.find("*",j)
if m == -1: m = len(ann)
-
- #print j,m
-
+
+ #print j,m
+
if (m - j) > bestlen:
bestlen = m - j
bestspan = (j, m)
bestdata = data
bestrf = key
-
+
#print rf, bestlen
#print seq[j:m+1]
#print ann[j:m+1]
-
+
j = m
# Format the best hit
if not bestdata:
- bestdata = d_collect.values()[0]
+ bestdata = list(d_collect.values())[0]
bestspan = (0,0)
- bestrf = d_collect.keys()[0]
-
+ bestrf = list(d_collect.keys())[0]
+
msg = "NO ORF FOUND (given the criteria '%s') for sequence '%s'\n\n" % (opt.orf,name)
- reports.append([msg])
-
+ reports.append([msg])
+
# clist = list(bestdata[1].lower())
name = bestdata[0]
pep = bestdata[1]
ann = bestdata[2]
dna_work = bestdata[3]
-
+
bpos, epos = bestspan
orf_dna = dna_work[bpos*3:epos*3]
orf = mod_translate.translate(orf_dna,mtx,True,False)
new_pep = " "*bpos + orf + " "*(len(pep)-epos)
-
+
name += "_ORF"
-
+
d_collect = {}
d_collect[bestrf] = (name,new_pep,ann,orf_dna)
#print d_collect
-
+
# Processing and Pretty printing...
if opt.readingframe in ["all","plus","minus"] and (not opt.orf):
if opt.readingframe == "all":
@@ -677,17 +680,17 @@ def parseOpts():
dna_plus = dna
dna_minus = revCom(dna)
-
+
#dnapl = list(dna_plus)
#dnaml = list(dna_minus)
# dnapann = [" "]*len(dna_plus)
# dnamann = [" "]*len(dna_minus)
dnapann = ["."]*len(dna_plus)
dnamann = ["."]*len(dna_minus)
-
+
vals = []
labels = []
-
+
if doPlus:
pep1 = d_collect["1"][1]
ann1 = d_collect["1"][2]
@@ -720,8 +723,8 @@ def parseOpts():
#dna_plus = "".join(dnapl)
vals += [pep3e,pep2e,pep1e,dna_plus,"".join(dnapann)]
labels += ["","","","5'",""]
-
- if doMinus:
+
+ if doMinus:
pepm1 = d_collect["-1"][1]
annm1 = d_collect["-1"][2]
pepm1e = explodePep(pepm1)+" "+" "
@@ -741,7 +744,7 @@ def parseOpts():
peps = [pepm1,pepm2,pepm3]
maxlen = max(len(pepm1),len(pepm2))
maxlen = max(maxlen,len(pepm3))
-
+
for i in range(0,maxlen):
for j in range(0,3):
ann = anns[j]
@@ -766,37 +769,37 @@ def parseOpts():
# vals = [pep3,pep2,pep1,dna_plus,revStr(dna_minus),revStr(pepm1),revStr(pepm2),revStr(pepm3)]
# labels = ["","","","5'","3'","","",""]
title = "%s - reading frame(s): %s" % (name,opt.readingframe)
-
+
l = makePretty(title,vals,labels,len(dna_plus))
-
+
###print "".join(l)
reports.append(l)
-
+
else:
- rf_list = d_collect.keys()
+ rf_list = list(d_collect.keys())
rf = rf_list[0]
-
+
vals = []
labels = []
title = "%s\nReading frame: %s" % (name,rf)
-
+
#rf_int = abs(int(rf)) - 1
pep = d_collect[rf][1]
pepx = explodePep(pep)
-
+
vals = []
labels = []
-
+
vals.append(pepx)
labels.append("")
-
+
pa = d_collect[rf][2]
pax = ["."]*len(pa)*3
if rf.startswith("-"):
dna_minus = revCom(dna)
#print len(dna_minus)
dnamann = ["."] * len(dna_minus)
- if rf == "-1":
+ if rf == "-1":
pepm = d_collect["-1"][1]
annm = d_collect["-1"][2]
pepme = explodePep(pepm)+" "+" "
@@ -811,9 +814,9 @@ def parseOpts():
annm = d_collect["-3"][2]
pepme = " "+explodePep(pepm)+" "+" "
j = 2
-
+
pepme = pepme[:len(dna_minus)]
-
+
for i in range(0,len(pepm)):
ac = annm[i]
if ac in ["M","m","*"]:
@@ -852,42 +855,42 @@ def parseOpts():
if ac in ["M","m","*"]:
dnapos = (i*3) + j
if ac == "M": c = ">"
- elif ac == "m": c = ")"
+ elif ac == "m": c = ")"
else: c = "*"
for k in range(dnapos,dnapos+3):
dnapann[k] = c
-
+
ann = "".join(dnapann)
vals = [pepx,dna_plus,ann]
- labels = ["","5'",""]
-
-
+ labels = ["","5'",""]
+
+
l = makePretty(title,vals,labels,len(dna))
reports.append(l)
###print "".join(l)
-
-
- for rf in rf_list:
+
+
+ for rf in rf_list:
name,seq,ann,com = d_collect[rf]
new_com = ""
-
+
# Seq may contain leding and trailing spaces if we are in ORF finde mode
if opt.orf:
new_seq = []
new_ann = []
for i in range(0,len(seq)):
- if seq[i] <> " ":
+ if seq[i] != " ":
new_seq.append(seq[i])
new_ann.append(ann[i])
-
+
seq = "".join(new_seq)
ann = "".join(new_ann)
-
+
new_com = '/orf_mode="%s"; /dna="%s";' % (opt.orf,com)
-
+
pepseqs.append( (name,seq,ann,new_com) )
# pepseqs.append( (cname,pep,pa,com.strip()) )
-
+
# Files with intron/exon annotation ---------------------------------------------
else:
exon_count = 1
@@ -895,17 +898,17 @@ def parseOpts():
start, end = mo.span()
seq_proc += dna[start:end]
ann_proc += ann[start:end]
-
+
exnum_chr = "%x" % exon_count
exnum_proc += exnum_chr * (end-start)
exon_count = (exon_count + 1) % 0x10
-
+
# DEBUG
if opt.debug:
- print seq_proc
- print ann_proc
- print rf_proc
- print exnum_proc
+ print(seq_proc)
+ print(ann_proc)
+ print(rf_proc)
+ print(exnum_proc)
# Process comments
if opt.keepcomment or opt.processcomment:
@@ -928,37 +931,37 @@ def parseOpts():
rf_proc = "012" * (len(ann_proc) / 3)
rf_proc += "012"[:len(ann_proc) % 3]
- for i in range(1,len(pa)*3): # Skip first position
+ for i in range(1,len(pa)*3): # Skip first position
if ann_proc[i] == "(":
pa[i/3] = rf_proc[i]
# Store translated sequence
prot_ann = "".join(pa)
pepseqs.append( (name,pep,prot_ann,com.strip()) )
-
+
# Pretty printing for the report
title = "%s - " % (name)
if opt.exonann:
title += "translation and annotation of the exonic structure"
-
+
elif opt.intronrf:
title += "translation and annotation of the position and phase of the introns"
-
+
vals = [pep,prot_ann]
labels = ["pep:","ann:"]
l = makePretty(title,vals,labels,len(pep))
reports.append(l)
-
+
# Output the results -------------------------------------------------------
# Step 1) Combined results.
if opt.outfile: outstrem = openForWriteOrDie(opt.outfile)
else: outstream = sys.stdout
-
+
if opt.outformat in ["REPORT","AUTO"]:
for l in reports: outstream.writelines(l)
- print >> outstream, "//"
-
+ print("//", file=outstream)
+
if "TAB" == opt.outformat:
outFasta = False
elif "FASTA" == opt.outformat:
@@ -970,24 +973,24 @@ def parseOpts():
writeFasta(pepseqs,outstream)
else:
writeTab(pepseqs,outstream)
-
- #outstream.close()
+
+ #outstream.close()
# Step 2) Write specifik sub-result if requested
if opt.reportfile:
if opt.reportfile == "-": outstream = sys.stdout
else: outstream = openForWriteOrDie(opt.reportfile)
-
+
for l in reports: outstream.writelines(l)
-
+
if opt.fastafile:
if opt.fastafile == "-": outstream = sys.stdout
else: outstream = openForWriteOrDie(opt.fastafile)
-
+
writeFasta(pepseqs,outstream)
-
+
if opt.tabfile:
if opt.tabfile == "-": outstream = sys.stdout
else: outstream = openForWriteOrDie(opt.tabfile)
-
+
writeTab(pepseqs,outstream)
diff --git a/bin/virtualribosomev2/mod_translate.py b/bin/virtualribosomev2/mod_translate.py
old mode 100644
new mode 100755
index c9017f9..af9af91
--- a/bin/virtualribosomev2/mod_translate.py
+++ b/bin/virtualribosomev2/mod_translate.py
@@ -1,4 +1,6 @@
-#!/usr/local/python/bin/python
+#!/usr/bin/env python3
+
+# Script was modified form original to use python3
# Copyright 2002,2003,2004,2005 Rasmus Wernersson, Technical University of Denmark
#
@@ -35,7 +37,7 @@ def __init__(self):
self.description = ""
self.d_all = {}
self.d_first = {}
-
+
def toString(self):
return "Description: %s\nd_all: %s\nd_first: %s" % (self.description,str(self.d_all),str(self.d_first))
@@ -96,7 +98,7 @@ def toString(self):
iupac["N"] = "ACGT" #aNy
-alphaDNA = "ACGTRYMKWSBDHVN"
+alphaDNA = "ACGTRYMKWSBDHVN"
alphaDNAStrict = "ACGT"
alphaPep ="*ACDEFGHIKLMNPQRSTVWY"
@@ -112,19 +114,19 @@ def parseNcbiTable(lines):
for line in lines.split("\n"):
line = line.strip()
#print "!!!"+line
-
+
if line.startswith("name ") and (desc == ""):
dRec.description += line.split('"')[1]+" "
-
+
elif line.startswith("id "):
tab_id = line.split()[1]
-
+
elif line.startswith("ncbieaa"):
aa_all = line.split('"')[1]
-
+
elif line.startswith("sncbieaa"):
aa_first = line.split('"')[1]
-
+
c = 0
for b1 in "TCAG":
for b2 in "TCAG":
@@ -134,7 +136,7 @@ def parseNcbiTable(lines):
aaf = aa_first[c]
if aaf == "-": aaf = aa_all[c]
dRec.d_first[codon] = aaf
-
+
c += 1
result[tab_id] = dRec
dRec = TransTableRec()
@@ -145,51 +147,51 @@ def parseMatrixLines(iterator):
result = {}
for line in iterator:
line = line.strip()
-
- if not line:
+
+ if not line:
continue # Ignore blank lines
- if line.startswith("#"):
+ if line.startswith("#"):
continue # Ignore comment lines
-
+
tokens = line.split()
try:
codon, aa = tokens
-
+
# Skip invalid entries
- if len(codon) <> 3:
+ if len(codon) != 3:
badCodon = 1
else:
codon = codon.upper().replace("U","T")
- for c in codon:
+ for c in codon:
if not c in alphaDNAStrict: badCodon = 1
badCodon = 0
-
+
if badCodon:
raise "Bad codon: %s [%s]" % (codon,line)
-
-
- if len(aa) <> 1: #or (not aa in alphaPep):
+
+
+ if len(aa) != 1: #or (not aa in alphaPep):
raise "Bad aa: %s [%s]" % (aa,line)
-
+
result[codon] = aa
-
- except Exception, e:
+
+ except Exception as e:
if DEBUG:
sys.stderr.write("Matrix Error - %s\n" % e)
-
+
if len(d) != 64 and DEBUG:
sys.stderr.write("Matrix Error - size of matrix differs from 64 [%i]\n" % len(d))
-
+
return result
-
+
def parseMatrixFile(filename):
- if d_ncbi_table.has_key(filename):
+ if filename in d_ncbi_table:
dRec = d_ncbi_table[filename]
return dRec
-
+
dRec = TransTableRec()
- dRec.d_all = parseMatrixLines(open(filename,"r").xreadlines())
+ dRec.d_all = parseMatrixLines(open(filename,"r"))
dRec.d_first = dRec.d_all
dRec.description = "Custom translation table '%s'" % filename
return dRec
@@ -208,29 +210,29 @@ def trim_old(seq):
# Assumption: Degenerate codons are rare - speed is not an issue
def decode(codon,dRec,isFirst):
if len(codon) != 3: return []
-
+
#Use the relevant translation table
if isFirst: d_gc = dRec.d_first
else: d_gc = dRec.d_all
-
+
#Check for the simple case: this is a standard non-degenerate codon
- if d_gc.has_key(codon):
+ if codon in d_gc:
return [ d_gc[codon]]
-
+
#The codon is to some degree degenerate - start the whole recursive scheme
result = []
-
+
for i in range(0,3):
p = iupac[codon[i]]
if (len(p) > 1):
for c in p:
result += (decode(codon[0:i]+c+codon[i+1:3],dRec,isFirst))
return result
-
+
if len(p) == 0:
return [] # Unknown/illegal char
#return [d[codon]]
-
+
def condense(lst):
result = []
for e in lst:
@@ -239,7 +241,7 @@ def condense(lst):
def translate(seq,transRec):
return translate(seq,transRec,True,True)
-
+
def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon):
debug = False
if not transRec:
@@ -248,14 +250,14 @@ def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon):
result = []
seq = trim(seq)
if firstIsStartCodon: isFirst = True
- else: isFirst = False
+ else: isFirst = False
for i in range(0,len(seq),3):
aa = condense(decode(seq[i:i+3],transRec,isFirst))
- if debug: print seq[i:i+3], aa
+ if debug: print(seq[i:i+3], aa)
if aa:
- if aa[0] == "*" and not readThroughStopCodon:
+ if aa[0] == "*" and not readThroughStopCodon:
break
-
+
if len(aa) == 1: result.append(aa[0])
else:
s = string.join(aa,"")
@@ -264,12 +266,12 @@ def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon):
else : result.append("X") #Any
#print seq[i:i+3]
isFirst = False
-
- pepseq = "".join(result)
+
+ pepseq = "".join(result)
if debug:
- print pepseq
+ print(pepseq)
return pepseq
-
+
# Annotate all possible start and stop codons
def annotate(seq,transRec):
debug = False
@@ -279,18 +281,18 @@ def annotate(seq,transRec):
for i in range(0,len(seq),3):
codon = seq[i:i+3]
aa = condense(decode(codon,transRec,True))
- if debug: print seq[i:i+3], aa
+ if debug: print(seq[i:i+3], aa)
if aa:
- if "*" in aa:
+ if "*" in aa:
result.append("*")
elif ["M"] == aa:
aa_int = condense(decode(codon,transRec,False))
if aa_int == ["M"]: result.append("M")
else: result.append("m")
- else:
- result.append(".")
-
- return "".join(result)
+ else:
+ result.append(".")
+
+ return "".join(result)
#def translate(seq):
# translate(seq,None)
@@ -298,10 +300,10 @@ def annotate(seq,transRec):
try:
import ncbi_genetic_codes
d_ncbi_table = parseNcbiTable(ncbi_genetic_codes.ncbi_gc_table)
-
+
except:
pass
-
+
if __name__ == "__main__":
for line in sys.stdin.readlines():
- print translate(line,None,True,False)
+ print(translate(line,None,True,False))
diff --git a/docs/HelpDocumentation.md b/docs/HelpDocumentation.md
index 5ec083b..20fb68b 100644
--- a/docs/HelpDocumentation.md
+++ b/docs/HelpDocumentation.md
@@ -1,56 +1,52 @@
![vAMPirus logo](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vamplogo.png)
-# Table of contents
-* [Introduction to vAMPirus](#Introduction-to-vAMPirus)
- * [Contact/support](#Contact/support)
- * [Who to cite](#Who-to-cite)
-* [Getting started with vAMPirus](#Getting-started-with-vAMPirus)
- * [Order of operations](#Order-of-operations)
- * [Windows OS users](#Windows-OS-users)
- * [MacOS users](#MacOS-users)
- * [Installing and running the VM](#Installing-and-running-the-VM-on-MacOS)
- * [Install Homebrew](#Install-Homebrew)
- * [Install Vagrant and Virtual Box](#Install-Vagrant-and-Virtual-Box)
- * [Building and starting](#Building-and-starting-your-virtual-environment)
- * [Transferring files](#Transferring-files-to-and-from-VM-with-Vagrant-scp)
-* [Installing vAMPirus](#Installing-vAMPirus)
- * [Cloning the repository](#Cloning-the-repository-(skip-if-you-generated-the-Vagrant-virtual-environment))
- * [Setting up vAMPirus](#Setting-up-vAMPirus-dependencies-and-checking-installation)
- * [Databases](#Databases)
-* [Running vAMPirus](#Running-vAMpirus)
- * [Testing vAMPirus installation](#Testing-vAMPirus-installation)
- * [Containers](#Using-Singularity)
-* [Extra notes](#Things-to-know-before-running-vAMPirus)
-
-
# Introduction to vAMPirus
Viruses are the most abundant biological entities on the planet and with advances in next-generation sequencing technologies, there has been significant effort in deciphering the global virome and its impact in nature (Suttle 2007; Breitbart 2019). A common method for studying viruses in the lab or environment is amplicon sequencing, an economic and effective approach for investigating virus diversity and community dynamics. The highly targeted nature of amplicon sequencing allows in-depth characterization of genetic variants within a specific taxonomic grouping facilitating both virus discovery and screening within samples. Although, the high volume of amplicon data produced combined with the highly variable nature of virus evolution across different genes and virus-types can make it difficult to scale and standardize analytical approaches. Here we present vAMPirus (https://github.com/Aveglia/vAMPirus.git), an automated and easy-to-use virus amplicon sequencing analysis program that is integrated with the Nextflow workflow manager facilitation easy scalability and standardization of analyses.
+![vAMPirus general workflow](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vAMPirus_generalflow.png)
+
The vAMPirus program contains two different pipelines:
-1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform future analyses
+1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform analyses
-![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_datacheckUPDATED.png)
+![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_datacheckV2.png)
2. Analyze pipeline: a comprehensive analysis of the provided data producing a wide range of results and outputs which includes an interactive report with figures and statistics. NOTE- stats option has changed on 2/19/21; you only need to add "--stats" to the launch commmand without "run"
-
-![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_analysisUPDATED.png)
-
+![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_analyzeV2.png)
## Contact/support
If you have a feature request or any feedback/questions, feel free to email vAMPirusHelp@gmail.com or you can open an Issue on GitHub.
+## Changes in version 2.0.0
+
+1. (EXPERIMENTAL) Added Minimum Entropy Decomposition analysis using the oligotyping program produced by the Meren Lab. This allows for sequence clustering based on sequence positions of interest (biologically meaningful) or top positions with the highest Shannon's Entropy (read more here: https://merenlab.org/software/oligotyping/ ; and below).
+
+2. Added more useful taxonomic classification of sequences leveraging the RVDB annotation database and/or NCBI taxonomy files (read more below).
+
+3. Replaced the used of MAFFT with muscle v5 (Edgar 2021) for more accurate virus gene alignments (see https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full).
+
+4. Added multiple primer pair removal to deal with multiplexed amplicon libraries.
+
+5. ASV filtering - you can now provide a "filter" and "keep" database to remove certain sequences from the analysis
+
+6. Reduced redundancy of processes and the volume of generated result files per full run (Example - read processing only done once if running DataCheck then Analyze).
+
+7. Color nodes on phylogenetic trees based on Taxonomy or Minimum Entropy Decomposition results
+
+8. PCoA plots added to report if NMDS does not converge.
+
+
## Who to cite
If you do use vAMPirus for your analyses, please cite the following ->
1. vAMPirus - Veglia, A.J., Rivera Vicens, R., Grupstra, C., Howe-Kerr, L., and Correa A.M.S. (2020) vAMPirus: An automated virus amplicon sequencing analysis pipeline. Zenodo. *DOI:*
-2. Diamond - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176
+2. DIAMOND - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176
3. FastQC - Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
@@ -62,7 +58,7 @@ If you do use vAMPirus for your analyses, please cite the following ->
7. ModelTest-NG - Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B., & Flouri, T. (2020). ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Molecular biology and evolution, 37(1), 291-294.
-8. MAFFT - Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780.
+8. muscle v5 - R.C. Edgar (2021) "MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping" https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full.pdf
9. vsearch - Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584.
@@ -78,22 +74,24 @@ If you do use vAMPirus for your analyses, please cite the following ->
15. UNOISE algorithm - R.C. Edgar (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257
+16. Oligotyping - A. Murat Eren, Gary G. Borisy, Susan M. Huse, Jessica L. Mark Welch (2014). Oligotyping analysis of the human oral microbiome. Proceedings of the National Academy of Sciences Jul 2014, 111 (28) E2875-E2884; DOI: 10.1073/pnas.1409644111
+
# Getting started with vAMPirus
-## Order of operations
+## General order of operations
1. Clone vAMPirus from github
-2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases
+2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases (NOTE: You will need to have the xz program installed before running startup script when downloading the RVDB database)
-3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see STARTUP_HELP.txt for test commands and other examples)
+3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see EXAMPLE_COMMANDS.txt in the vAMPirus directory for test commands and other examples)
4. Edit parameters in vampirus.config file
-5. Launch the DataCheck pipeline to get summary information about your dataset
+5. Launch the DataCheck pipeline to get summary information about your dataset (e.g. sequencing success, read quality information, clustering behavior of ASV or AminoTypes)
-6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length)
+6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length, Shannon entropy analysis results)
7. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset
@@ -108,7 +106,7 @@ All you will need to do is set up the subsystem with whatever flavor of Linux yo
Search for Linux in the Microsoft store -> https://www.microsoft.com/en-us/search?q=linux
-It should be noted that vAMPirus was developed on Centos7/8.
+It should be noted that vAMPirus has been mainly tested on Centos7/8.
Here are some brief instructions for setting up Ubuntu 20.04 LTS on Windows 10 sourced from https://ubuntu.com/tutorials/ubuntu-on-windows#1-overview
@@ -152,7 +150,7 @@ Next we will install Java to be able to run Nextflow -
sudo apt -y install openjdk-8-jre
-Now you have everything you need to get started as described in the * [Installing vAMPirus](#Installing-vAMPirus) section. However, here are some quick commands to install and set up Conda.
+Now you have everything you need to get started as described in the * [Installing vAMPirus](#Installing-vAMPirus) section. However, here are some quick commands to install and set up Miniconda.
NOTE=> Windows WSL currently can not run Singularity so you will have to install and run vAMPirus with Conda.
@@ -175,14 +173,14 @@ You can check/confirm you have conda ready to go ->
conda init
-Once you have have your Conda ready, you can execute the vAMPirus startup script to install Nextflow and build the vAMPirus conda environment.
+Once you have your Conda ready, you can execute the vAMPirus startup script to install Nextflow and build the vAMPirus conda environment.
## MacOS users
If you plan to run vAMPirus on a Mac computer, it is recommended that you set up a virtual environment and use Singularity with the vAMPirus Docker image to run your analyses.
-You can try to run directly on your system, but there may be errors caused by differences between Apply and GNU versions of tools like "sort".
+You can try to run directly on your system, but there may be errors caused by differences between Apple and GNU versions of tools like "sort".
vAMPirus was developed on a Centos7/8 operating system so we will go through how to set up a Centos7 Vagrant virtual environment with Virtual Box.
@@ -213,7 +211,7 @@ Once done with the full installation, execute:
(Information from https://treehouse.github.io/installation-guides/mac/homebrew)
-Now we should be good to use Homebrew to install Vagrant and VirtualBox to set up the VPN.
+Now we should be good to use Homebrew to install Vagrant and VirtualBox to set up the VM.
#### Install Vagrant and Virtual Box
@@ -244,7 +242,7 @@ Virtual Box (WILL CAUSE ERROR IF ORACLE NOT GIVEN PERMISSION TO INSTALL DEPENDEN
NOTE=> In this part of the setup, you might get an error saying Oracle was denied permission to install programs. You will need to go to System Preferences->Security and Privacy->General and allow Oracle permission to download programs and then rerun the above command.
-Alright, if you notice no errors during installation, you should be good to go and create the Centos 7 environment
+Alright, if you notice no errors during installation, you should be good to go and create the Centos7 environment
#### Building and starting your virtual environment
@@ -281,6 +279,7 @@ We will make our own that looks like this:
yum -y install epel-release
yum -y install htop
yum -y install nano
+ yum -y install xz
git clone https://github.com/Aveglia/vAMPirus.git
yum -y install singularity
SHELL
@@ -317,7 +316,7 @@ You should now be in your fresh Centos7 virtual environment and if you ls you wi
You can now follow the normal directions for setting up vAMPirus with singularity.
-But here is the quick overview of recommended next steps:
+But here is the quick overview of recommended next steps (without database/taxonomy install):
cd ./vAMPirus; bash vampirus_startup.sh -s
@@ -327,7 +326,7 @@ After running the above you should now have Nextflow installed. Now, build the S
then test the Analyze pipeline with:
- ./nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats
+ ./nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats
Please check out http://sourabhbajaj.com/mac-setup/Vagrant/README.html and https://www.vagrantup.com/docs/providers/virtualbox for understanding how to use Vagrant commands like "halt", "suspend" or "reload"
@@ -393,7 +392,7 @@ You will also need to decide if you plan to use a container engine like Docker (
The startup script provided in the vAMPirus program directory will install Conda for you if you tell it to (see below), however, you will need to install Docker or Singularity separately before running vAMPirus.
-## Setting up vAMPirus dependencies and checking installation
+### Setting up vAMPirus dependencies and checking installation
To set up and install vAMPirus dependencies, simply move to the vAMPirus directory and run the vampirus_startup.sh script.
@@ -412,30 +411,33 @@ You can also use the startup script to install different databases to use for vA
2. The proteic version of the Reference Viral DataBase (RVDB) (See https://f1000research.com/articles/8-530)
3. The complete NCBI NR protein database
-To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option.
-If we look at the script usage:
+To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option and you can download the NCBI taxonomy files with the option "-t" (See below).
- General execution:
+If we take a look at the vampirus_startup.sh script usage:
- vampirus_startup.sh -h [-d 1|2|3|4] [-s]
+General execution:
- Command line options:
+vampirus_startup.sh -h [-d 1|2|3|4] [-s] [-t]
- [ -h ] Print help information
+ Command line options:
- [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment:
+ [ -h ] Print help information
- 1 - Download the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530)
- 2 - Download only NCBIs Viral protein RefSeq database
- 3 - Download only the complete NCBI NR protein database
- 4 - Download all three databases
+ [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment:
- [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container)
+ 1 - Download only the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530)
+ 2 - Download only NCBIs Viral protein RefSeq database
+ 3 - Download only the complete NCBI NR protein database
+ 4 - Download all three databases
+ [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container)
-For example, if you would like to install Nextflow, download NCBIs Viral protein RefSeq database, and check/install conda, run:
+ [ -t ] Set this option to download NCBI taxonomy files needed for DIAMOND to assign taxonomic classification to sequences (works with NCBI type databases only, see manual for more information)
- bash vampirus_startup.sh -d 2
+
+For example, if you would like to install Nextflow, download NCBIs Viral Protein RefSeq database, the NCBI taxonomy files to use DIAMOND taxonomy assignment feature, and check/install conda, run:
+
+ bash vampirus_startup.sh -d 2 -t
and if we wanted to do the same thing as above but skip the Conda check/installation, run:
@@ -443,29 +445,37 @@ and if we wanted to do the same thing as above but skip the Conda check/installa
NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed.
+**NEW in version 2.0.0 -> the startup script will automatically download annotation information from RVDB to infer Lowest Common Ancestor (LCA) information for hits during taxonomy assignment. You can also use "-t" to download NCBI taxonomy files to infer taxonomy using the DIAMOND taxonomy classification feature.
+
### Databases
-It should be noted, that any protein database can be used, but it needs to be in fasta format and the headers for reference sequences need to match
-one of two patterns:
+Any protein database can be used while running vAMPirus, however, it needs to be in fasta format and the headers for reference sequences need to match one of two patterns:
-RVDB format (default) -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]"
+RVDB format -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]"
NCBI NR/RefSeq format -> ">KJX92028.1 hypothetical protein TI39_contig5958g00003 [Zymoseptoria brevis]"
-During Taxonomy Inference, vAMPirus infers results by extracting the information stored in the reference sequence headers. If the database sequence headers do not match these
-patterns, you are bound to see errors in the naming of files created during the Taxonomy Inference phase of vAMPirus.
+To set/inform vAMPirus of which header format for the reference database is being used, you can edit the vampirus.config file at line 122 "dbtype="NCBI"" for NCBI header format or "dbtype="RVDB"" for RVDB format.
+
+An example of custom headers in RVDB format if you plan to use a custom database:
-The default is that vAMPirus assumes that the database headers are in RVDB format, to change this assumption, you would need to edit the configuration file at line 78 where "refseq=F". You could also signal the use of RefSeq format headers within the launch command with adding "--refseq T".
+ `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 1]`
+ AMINOACIDSEQUENCE
+ `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 2]`
+ AMINOACIDSEQUENCE
+ `>acc|Custom|VP2000.1|Custom|VP2000| capsid protein [T7 phage isolate]`
+ AMINOACIDSEQUENCE
-An example of custom headers if you plan to use a custom database:
+Or in NCBI format the same sequences would be:
+
+ `>VP100000.1 capsid protein [T4 Phage isolate 1]`
+ AMINOACIDSEQUENCE
+ `>VP100000.1 capsid protein [T4 Phage isolate 2]`
+ AMINOACIDSEQUENCE
+ `>VP2000.1 capsid protein [T7 phage isolate]`
+ AMINOACIDSEQUENCE
- `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 1]`
- AMINOACIDSEQUENCE
- `>acc|Custom|VP100000.2|Custom|VP100000|capsid protein [T4 Phage isolate 2]`
- AMINOACIDSEQUENCE
- `>acc|Custom|VP2000.1|Custom|VP2000| capsid protein [T7 phage isolate]`
- AMINOACIDSEQUENCE
### Using Singularity
@@ -506,7 +516,7 @@ Using the yum package manager ->
After running the startup script, you can then test the vAMPirus installation with the supplied test dataset.
-The startup script will generate a text file (STARTUP_HELP.txt) that has instructions and example commands to test the installation.
+The startup script will generate a text file (EXAMPLE_COMMANDS.txt) that has instructions and example commands to test the installation.
NOTE => If using Singularity, when you run the test command calling for singularity (-profile singularity) Nextflow will set up the dependencies.
@@ -530,11 +540,11 @@ OR
Analyze test =>
- /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats
+ /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats
OR
- nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats
+ nextflow run vAMPirus.nf -c /path/to/vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats
### Resuming test analyses if you ran into an error
@@ -576,7 +586,7 @@ In the command above, there are five necessary pieces of information needed to s
2. Second, you must tell Nextflow to "run" the vAMPirus program which is described in the "vAMPirus.nf" file. Depending on where you plan to submit this command, you may have to specify the path to the vAMPirus.nf file or you can copy the file to your working directory.
-3. Next, we need to tell Nextflow what configuration file we would like to use for the vAMPirus run which is illustrated by the "-c vampirus.config" segment of the command.
+3. Next, we need to tell Nextflow what configuration file we would like to use for the vAMPirus run which is illustrated by the "-c vampirus.config" segment of the command. NOTE: config file can be called "anything".config.
4. The next piece of information Nextflow needs is whether you plan to use the vAMPirus conda environment or the vAMPirus Docker container with singularity.
@@ -586,7 +596,7 @@ Now that we have an understanding on how to deploy vAMPirus with Nextflow, let's
## The Nextflow monitoring screen
-When submitting a launch command to start your vAMPirus run, Nextflow will spit out something that looks like this:
+When submitting a launch command to start your vAMPirus run, Nextflow will spit out something that looks like this (example below from older verions):
executor > local (57)
[8a/75e048] process > Build_database [100%] 1 of 1 ✔
@@ -607,15 +617,33 @@ When submitting a launch command to start your vAMPirus run, Nextflow will spit
[26/1143ba] process > combine_csv_DC [100%] 1 of 1 ✔
[2e/e5fea3] process > Report_DataCheck [100%] 1 of 1 ✔
-Nextflow allows for interactive monitoring of submitted workflows, so in this example, we see the left column containing working directories for each process being executed, next to that we see the process name, and the final column on the right contains the status and success of each process. In this example each process has been executed successfully and has been cached. The amazing thing about Nextflow is that say you received an error or decide you would like to change a parameter/add a type of clustering you can use the Nextflow "-resume" option that means you don't re-run already completed processes.
+Nextflow allows for interactive monitoring of submitted workflows, so in this example, we see the left column containing working directories for each process being executed, next to that we see the process name, and the final column on the right contains the status and success of each process. In this example each process has been executed successfully and has been cached.
+
+You can also remotely monitor a run using Nextflow tower (see tower.nf) which will allow you to monitor your run (and even launch new runs) from a portal on your browser.
+
+## The Nextflow "-resume" feature
+
+The amazing thing about Nextflow is that it caches previously run processes done with the same samples. For example, in the case that you received an error during a run or run out of walltime on an HPC, you can just add "-resume" to your Nextflow launch command like so:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda,singularity] --Analyze|DataCheck -resume
+
+Nextflow will then pick up where the previous run left off.
+
+You can even use this feature to change a parameter/add a type of clustering or add Minimum Entropy Decomposition analysis, you would just rerun the same command as above with minor changes:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda,singularity] --Analyze|DataCheck --ncASV --asvMED -resume
+
+With this command, you will then add nucleotide-level clustering of ASVs and Minimum Entropy Decomposition analyses to your results directory, all without rerunning any processes.
## Understanding the vAMPirus config file and setting parameters
### The configuration file (vampirus.config)
-Nextflow deployment of vAMPirus relies on the use of the configuration file (vampirus.config) that is found in the vAMPirus program directory. The configuration file is a great way to store parameters/options used in your analyses. It also makes it pretty easy to set and keep track of multiple parameters as well as storing custom default values that you feel work best for your data. You can also have multiple copies of vAMPirus configuration files with different parameters, you would just have to specify the correct file with the "-c" argument shown in the section before.
+Nextflow deployment of vAMPirus relies on the use of the configuration file (vampirus.config - can be renamed to anything as long as its specified in the launch command) that is found in the vAMPirus program directory. The configuration file is a great way to store parameters/options used in your analyses. It also makes it pretty easy to set and keep track of multiple parameters as well as storing custom default values that you feel work best for your data. You can also have multiple copies of vAMPirus configuration files with different parameters, you would just have to specify the correct file with the "-c" argument shown in the section before.
+
+Furthermore, the configuration file contains analysis-specific parameters AND resource-specific Nextflow launching parameters. A benefit of Nextflow integration, is that you can run the vAMPirus workflow on a large HPC just as easily as you could on your local machine.
-Furthermore, the configuration file contains analysis-specific parameters AND resource-specific Nextflow launching parameters. A benefit of Nextflow integration, is that you can run the vAMPirus workflow on a large HPC just as easily as you could on your local machine. If you look at line 151 and greater in the vampirus.config file, you will see resource-specific parameters that you can alter before any run. Nexflow is capable of submitting jobs automatically using slurm and PBS, check out the Nextflow docs to learn more (https://www.nextflow.io/docs/latest/executor.html)!
+If you look at line 233 and greater in the vampirus.config file, you will see resource-specific parameters that you can alter before any run. Nexflow is capable of submitting jobs automatically using slurm and PBS, check out the Nextflow docs to learn more (https://www.nextflow.io/docs/latest/executor.html)!
### Setting parameter values
@@ -625,22 +653,20 @@ There are two ways to set parameters with Nextflow and vAMPirus:
Here we have a block from the vampirus.config file that stores information related to your run:
- // Project/analyses- specific information
+ // Project specific information
+
// Project name - Name that will be used as a prefix for naming files by vAMPirus
- projtag="vAMPrun"
+ projtag="vAMPirusAnalysis"
// Path to metadata spreadsheet file to be used for plot
- metadata="/PATH/TO/metadata.csv"
- // Minimum number of hit counts for a sample to have to be included in the downstream analyses and report generation
- minimumCounts="1000"
- // PATH to current working directory
- mypwd="/PATH/TO/working_directory"
- email="your_email@web.com"
- // reads directory
- reads="/PATH/TO/reads/R{1,2}_001.fastq.gz"
- // Directory to store output of vAMPirus analyses
+ metadata="/PATH/TO/vampirus_meta.csv"
+ // reads directory, must specify the path with "R{1,2}" for reads to be properly read by Nextflow
+ reads="/PATH/TO/reads/"
+ // PATH to working directory of your choosing, will automatically be set to vAMPirus installation
+ workingdir="VAMPDIR"
+ // Name of directory created to store output of vAMPirus analyses (Nextflow will create this directory in the working directory)
outdir="results"
-The first one in the block is the project tag or "projtag" which by default, if unchanged, will use the prefix "vAMPrun". To change this value, and any other parameter value, just edit right in the configuration file so if you wanted to call the run "VirusRun1" you would edit the line to:
+The first one in the block is the project tag or "projtag" which by default, if unchanged, will use the prefix "vAMPirusAnalysis". To change this value, and any other parameter value, just edit right in the configuration file so if you wanted to call the run "VirusRun1" you would edit the line to:
// Project/analyses- specific information
// Project name - Name that will be used as a prefix for naming files by vAMPirus
@@ -649,19 +675,19 @@ The first one in the block is the project tag or "projtag" which by default, if
2. Set the value within the launch command itself:
-Instead of editing the configuration file directly, you could set parmeters within the launching command itself. So, for example, if we wanted to run the analysis with nucletide-based clustering of ASVs at 95% similarity, you would do so like this:
+Instead of editing the configuration file directly, you could set parameters within the launching command itself. So, for example, if we wanted to run the analysis with nucleotide-based clustering of ASVs at 95% similarity, you would do so like this:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --clusterNuclID .95
-Here we use the "--Analyze" option that tells vAMPirus that we are ready to analyze soem data. Then the "--ncASV" argument with the "--clisterNuclID .95" tells vAMPirus we would like to cluster our ASVs based on 95% nucleotide similarity. The default ID value is stored at line 51 in the vampirus.config file (currently 85%), but as soon as you specify and provide a value in the command, the default value is overwritten.
+Here we use the "--Analyze" option that tells vAMPirus that we are ready to analyze some data. Then the "--ncASV" argument with the "--clisterNuclID .95" tells vAMPirus we would like to cluster our ASVs based on 95% nucleotide similarity. The default ID value is stored at line 66 in the vampirus.config file (currently 85%), but as soon as you specify and provide a value in the command, the value within the config file is ignored.
-NOTE: Nextflow also has options in the launch command. To tell them apart, Nextflow options uses a single dash (e.g. -with-conda) while vAMPirus options are always with a double dash (e.g. --Analyze)
+NOTE: Nextflow also has options in the launch command. To tell them apart, Nextflow options uses a single dash (e.g. -with-conda or -profile) while vAMPirus options are always with a double dash (e.g. --Analyze)
-### Setting computing resource parameters - Edit in lines 151-171 in vampirus.config
+### Setting computing resource parameters - Edit in lines 241-261 in vampirus.config
Each process within the vAMPirus workflow is tagged with either "low_cpus", "norm_cpus", or "high_cpus" (see below) which let's Nextflow know the amount of cpus and memory required for each process, which will then be used for when Nextflow submits a given job or task. Nexflow actively assesses the amount of available resources on your machine and will submit tasks only when the proper amount of resources can be requested.
-From line 203-217 in the vAMPirus.config file is where you can edit these values for whichever machine you plan to run the workflow on.
+From line 241-261 in the vAMPirus.config file is where you can edit these values for whichever machine you plan to run the workflow on.
process {
withLabel: low_cpus {
@@ -694,7 +720,7 @@ As stated before, you can launch vAMPirus on either your personal laptop OR a la
To specify certain parts of the vAMPirus workflow to perform in a given run, you can use skip options to have vAMPirus ignore certain processes. Here are the current skip options you can specify within the launch command:
// Skip options
- // Skip all Read Processing
+ // Skip all Read Processing steps
skipReadProcessing=false
// Skip quality control processes only
skipFastQC = false
@@ -712,6 +738,8 @@ To specify certain parts of the vAMPirus workflow to perform in a given run, you
skipEMBOSS = false
// Skip Reports
skipReport = false
+ // Skip Merging steps
+ skipMerging = false
To utilize these skip options, just add it to the launch command like so:
@@ -734,7 +762,7 @@ With this launch command, vAMPirus will perform ASV generation and nucleotide-ba
Once you have everything set up and you have edited the parameters of interest in your configuration file you can run the following launch command for a full analysis:
- nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats
This launch command will run all aspects of the vAMPirus workflow on your data and spit out final reports for each clustering %ID and technique.
@@ -742,7 +770,7 @@ This launch command will run all aspects of the vAMPirus workflow on your data a
### Sequencing reads
-Input can be raw or processed compressed or non-compressed fastq files with names containing "\_R1" or "\_R2". You can specify the directory containing your reads in line 25 of the vampirus.config file.
+Input can be raw or processed compressed or non-compressed fastq files with names containing "\_R1" or "\_R2". You can specify the directory containing your reads in line 20 of the vampirus.config file.
NOTE: Sample names are extracted from read library names by using the string to the left of the "\_R" in the filename automatically.
@@ -781,7 +809,7 @@ Usage example:
The DataCheck feature of vAMPirus is meant to give the user some information about their data so they can tailor their final analysis appropriately. In DataCheck mode, vAMPirus performs all read processing operations then generates ASVS and performs nucleotide- and protein-based clustering at 24 different clustering percentages ranging from 55-99% ID. vAMPirus then generates an html report that displays and visualizes read processing and clustering stats. It is recommended that before running any dataset through vAMPirus, you run the data through the DataCheck.
-Here is how Nextflow will display the list of processes vAMPirus will execute during DataCheck (executed with the launch command above):
+Here is how Nextflow will display the list of processes vAMPirus will execute during DataCheck (executed with the launch command above; below is an example from an older version of vAMPirus):
executor > local (57)
[8a/75e048] process > Build_database [100%] 1 of 1 ✔
@@ -805,6 +833,8 @@ Here is how Nextflow will display the list of processes vAMPirus will execute du
Every time you launch vAMPirus with Nextflow, you will see this kind of output that refreshes with the status of the different processes during the run.
+**Add "--asvMED" or "aminoMED" to the launch command above to get Shannon Entropy analysis resutls for ASVs and AminoTypes
+
2. "--Analyze"
@@ -901,6 +931,7 @@ Here is what the Nextflow output would look like for this launch command:
You can see that there are a few more processes now compared to the output of the previous launch command which is what we expect since we are asking vAMPirus to do a little bit more work for us :).
+
# Breaking it down: The vAMPirus workflow
## Read processing
@@ -921,21 +952,21 @@ This is the default action of vAMPirus if no primer sequences are provided, to s
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --GlobTrim 23,26
-The command above is telling vAMPirus to have bbduk.sh remove primers by trimming 23 bases from the forward reads and 26 bases from the reverse reads. The other way to initiate this method of primer removal is to add the same information at line 38 in the configuration file:
+The command above is telling vAMPirus to have bbduk.sh remove primers by trimming 23 bases from the forward reads and 26 bases from the reverse reads. The other way to initiate this method of primer removal is to add the same information at lime 38 in the configuration file:
// Primer Removal parameters
// If not specifying primer sequences, forward and reverse reads will be trimmed by number of bases specified using --GlobTrim #basesfromforward,#basesfromreverse
GlobTrim="23,26"
-By adding the information to line 38, vAMPirus will automatically use this method and these parameters for primer removal until told otherwise.
+By adding the information to lime 38, vAMPirus will automatically use this method and these parameters for primer removal until told otherwise.
If you want to change the number of bases without editing the configuration file, all you would need to do is then specify in the launch command with "--GlobTrim 20,27" and vAMPirus will ignore the "23,26" in the configuration file.
-NOTE: Specifying global trimming by editing line 38 in the config file or using "--GlobTrim" in the launch command will also override the use of primer sequences for removal if both are specified
+NOTE: Specifying global trimming by editing lime 38 in the config file or using "--GlobTrim" in the launch command will also override the use of primer sequences for removal if both are specified
2. Primer removal by specifying primer sequences -
-You can tell vAMPirus to have bbduk.sh search for and remove either a single primer paire or multiple.
+You can tell vAMPirus to have bbduk.sh search for and remove either a single primer pair or multiple.
In the case where you are using a single primer pair, similar to the previous method, you could edit the configuration file or specify within the launch command.
@@ -952,11 +983,11 @@ The primer sequences could also be stored in the configuration file in lines 43-
// Reverse primer sequence
rev="REVPRIMER"
-If you have multiple primer sequences to be removed, all you need to do is provide a fasta file with your primer sequences and signla multiple primer removal using the "--multi" option like so:
+If you have multiple primer sequences to be removed, all you need to do is provide a fasta file with your primer sequences and signal multiple primer removal using the "--multi" option like so:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --multi --primers /path/to/primers.fa
-You can set the path to the primer sequence fasta file within the launch command above or you can have it in the configuration file at line 44:
+You can set the path to the primer sequence fasta file within the launch command above or you can have it in the configuration file at lime 44:
// Path to fasta file with primer sequences to remove (need to specify if using --multi option )
primers="/PATH/TO/PRIMERS.fasta"
@@ -975,26 +1006,31 @@ There are also a few other options you can change to best match what your data w
### Read merging and length filtering
-Read merging in the vAMPirus workflow is performed by vsearch and afterwards, reads are trimmed to the expected amplicon length (--maxLen) and any reads with lengths below the user specified minimum read length (--minLen). There are three parameters that you can edit to influence this segment of vAMPirus. If we look at lines 26-33:
+Read merging in the vAMPirus workflow is performed by vsearch and afterwards, reads are trimmed to the expected amplicon length (--maxLen) and any reads with lengths below the user specified minimum read length (--minLen) are discarded. There are five parameters that you can edit to influence this segment of vAMPirus. If we look at lines 26-33:
- // Merged read length filtering parameters
- // Minimum merged read length - reads below the specified maximum read length will be used for counts only
- minLen="400"
- // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs
- maxLen="422"
- // Maximum expected error for vsearch merge command
- maxEE="1"
+ // Merged read length filtering parameters
+
+ // Minimum merged read length - reads with lengths greater than minLen and below the specified maximum read length will be used for counts only
+ minLen="400"
+ // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs (safe to set at expected amplicon size to start)
+ maxLen="420"
+ // Maximum expected error for vsearch merge command - vsearch discard sequences with more than the specified number of expected errors
+ maxEE="3"
+ // Maximum number of non-matching nucleotides allowed in overlap region
+ diffs="20"
+ // Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded
+ maxn="20"
The user can edit the minimum length (--minLen) for reads to be used for counts table generation, maximum length (--maxLen) for reads used to generate uniques and subsequent ASVs, and the expected error rate (--maxEE) for overlapping region of reads during read merging with vsearch. The values above are default and should be edited before running your data with --Analyze.
This is where the DataCheck report is very useful, you can review the report and see the number of reads that merge per library and you can edit the expected error value to be less stringent if needed. The DataCheck report also contains a read length distribution that you can use to select an ideal maximum/minimum read length.
-## Amplicon Sequence Variants, AminoTypes and Operational Taxonomic Units
+## Amplicon Sequence Variants, AminoTypes and Clustering
The goal of vAMPirus was to make it easy for the user to analyze their data is many different ways to potentially reveal patterns that would have been missed if limited to one method/pipeline.
-A major and sometimes difficult step in analyzing virus amplicon sequence data is deciding the method to use for identifying or defining different viral "species" in the data. To aid this process, vAMPirus has the DataCheck mode discussed above and has several different options for sequence clustering/analysis for the user to decide between.
+A major, and sometimes difficult, step in analyzing virus amplicon sequence data is deciding the method to use for identifying or defining different viral "species" in the data. To aid this process, vAMPirus has the DataCheck mode discussed above and has several different options for sequence clustering/analysis for the user to decide between.
vAMPirus relies on vsearch using the UNOISE3 algorithm to generate Amplicon Sequencing Variants (ASVs) from dereplicated amplicon reads. ASVs are always generated by default and there are two parameters that the user can specify either in the launch command or by editing the configuration file at lines 45-49:
@@ -1010,14 +1046,47 @@ Launch command to produce only ASV-related analyses:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats --skipAminoTyping
-Now, onto clustering ASVs into clustered ASVs (cASVs). vAMPirus is able to use two different techniques for generating cASVs for the user:
+### ASV filtering (experimental)
+
+New to version 2 you can now filter ASVs to remove sequences that belong to taxonomic groups that are not of interest for a given run.
+
+A great example of when this feature is useful is Prodinger et al. 2020 (https://www.mdpi.com/2076-2607/8/4/506). In this study they looked to amplify and analyze Mimiviridae polB sequences, however, polB is also found in cellular genomes like bacteria. In this case, Prodinger et al. looked to avoid including any bacterial polB in their final results and thus used a filtering step to remove microbial sequences. The ASV filtering feature can be used to do exactly this type of filtering where you provide paths to a "filter database" containing sequences belonging to non-target groups (e.g. microbial polB) and a "keep database" containing sequences belonging to the target group (e.g. Mimiviridae polB). Any ASVs that match non-target sequences will then be filtered from the ASV file prior to running the DataCheck or Analyze pipeline.
+
+Here are the options stored within the configuration file:
+
+ // ASV filtering parameters - You can set the filtering to run with the command --filter
+
+
+ // Path to database containing sequences that if ASVs match to, are then removed prior to any analyses
+ filtDB=""
+ // Path to database containing sequences that if ASVs match to, are kept for final ASV file to be used in subsequent analyses
+ keepDB=""
+ // Keep any sequences without hits - for yes, set keepnohit to ="true"
+ keepnohit="true"
+
+ //Parameters for diamond command for filtering
+
+ // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment
+ filtminID="80"
+ // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment
+ filtminaln="30"
+ // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive)
+ filtsensitivity="ultra-sensitive"
+ // Set the max e-value for best hit to be recorded
+ filtevalue="0.001"
+
+
+
+### Clustering options
+
+Dependining on the virus type/marker gene, ASV-level results can be noisy and to combat this vAMPirus has three different approaches to clustering ASV sequences:
1. AminoTyping -
vAMPirus by default, unless the --skipAminoTyping option is set, will generate unique amino acid sequences or "AminoTypes" from generated ASVs. These AminoTypes, barring any skip options set, will run through all the same analyses as ASVs.
-vAMPirus will translate the ASVs with Virtual Ribosome and relies on the user to specify the expected or minimum amino acid sequence length (--minAA) to be used for AminoTyping and pcASV generation (discussed below). For example, if you amplicon size is ~422 bp long, you would expect the amino acid translations to be ~140. Thus, you would either edit the --minAA value to be 140 in the configuration file (line 69) or in the launch command.
+vAMPirus will translate the ASVs with Virtual Ribosome and relies on the user to specify the expected or minimum amino acid sequence length (--minAA) to be used for AminoTyping and pcASV generation (discussed below). For example, if you amplicon size is ~422 bp long, you would expect the amino acid translations to be ~140. Thus, you would either edit the --minAA value to be 140 in the configuration file (lime 69) or in the launch command.
You can make it shorter if you would like, but based on personal observation, a shorter translation is usually the result of stop codons present which would usually be removed from subsequent analyses. If there are any sequences below the minimum amino acid sequence length the problematic sequence(s) and its translation will be stored in a directory for you to review.
@@ -1071,10 +1140,55 @@ Example launch command:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --pcASV --clusterAAIDlist .85,.90,.96 --stats
+## Minimum Entropy Decomposition (EXPERIMENTAL) - Oligotyping - https://merenlab.org/2012/05/11/oligotyping-pipeline-explained/
+
+In vAMPirus v2, we added the ability for the user to use the oligotyping program employing the Minimum Entropy Decomposition (MED) algorithm developed by Eren et al. 2015 (read more about MED here - https://www.nature.com/articles/ismej2014195#citeas) to cluster ASV or AminoType sequences.
+
+The MED algorithm provides an alternative way of clustering marker gene sequences using "information theory-guided decomposition" - "By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation." -Eren et al. 2015
+
+When you run the DataCheck pipeline with your dataset, the report will include a figure and table that breakdown the Shannon Entropy analysis results for both ASVs and AminoTypes. The figure visualizes entropy values per sequence position revealing positions or regions of high entropy. The table beneath the figure breaks down the number of positions with entropy values above "0.x". Although, if you know the positions on your sequence that have the potential to contain biologically or ecologically meaningful mutations, you can specify decomposition based on these positions.
+
+If you decide to use MED, vAMPirus will run all the same analyses that would be done with the ASV or AminoType sequences (e.g. diversity analyses, statistics) and be appended results to the ASV or AminoType report. The ASV or AminoType sequence nodes on the phylogenetic tree will also be colored based on which MED group they were assigned to.
+
+To add MED analysis to either the DataCheck or Analyze run you must add "--asvMED" and/or "--aminoMED" to the launch command (see examples below).
+
+There are two ways to utilize MED within the vAMPirus pipeline:
+
+ (1) Decomposition based on all sequence positions with an entropy value above "0.x" - Useful approach to preliminarily test influence of MED on your sequences
+
+ Example -> Entropy value table from the DataCheck report shows I have 23 ASV sequence positions that have Shannon entropy values above 0.1 and I would like to oligotype using all of these high entropy positions.
+
+ To use these 23 positions for MED clustering of ASVs, all I need to do is add the options "--asvMED" (signals use of MED on ASV sequences) and "--asvC 23" (specifies the number of high entropy positions to be used for MED - could also be done by editing "asvC="23"" in config file) to the launch command:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --asvMED --asvC 23
+
+ After this run completes successfully and you move the ASV report file to a safe area for review, if I wanted to see what happens if just the top 5 high entropy positions were used instead, I can use the "-resume" option and run:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --asvMED --asvC 5 -resume
+
+
+ (2) Decomposition based on specific sequence positions that may contain biologically/ecologically meaningful differences.
+
+ Example -> I know that amino acid differences at certain positions on my AminoTypes are ecologically meaningful (e.g. correlate with host range) and I would like to perform MED with these positions only.
+
+ To do this, similar to the example above, I will add "--aminoMED" and "--aminoC" to the launch command:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --aminoMED --aminoC 2,3,4,25,34,64 -resume
+
+MED related options within the configuration file:
+
+ // Minimum Entropy Decomposition (MED) parameters for clustering (https://merenlab.org/2012/05/11/oligotyping-pipeline-explained/)
+
+ // If you plan to do MED on ASVs using the option "--asvMED" you can set here the number of entropy peak positions or a comma separated list of biologically meaningful positons (e.g. 35,122,21) for oligotyping to take into consideration. vAMPirus will automatically detect a comma separated list of positions, however, if you want to use a single specific position, make "asvSingle="true"".
+ asvC=""
+ asvSingle=""
+ // If you plan to do MED on ASVs using the option "--aminoMED" you can set here the number of entropy peak positions or a comma separated list of biologically meaningful positons (e.g. 35,122,21) for oligotyping to take into consideration. vAMPirus will automatically detect a comma separated list of positions, however, if you want to use a single specific position, make "aminoSingle="true"".
+ aminoC=""
+ aminoSingle=""
## Counts tables and percent ID matrices
-vAMPirus generate nucleotide-based counts tables using vsearch and protein-based counts tables using DIAMOND and a custom bash script. Counts tables and percent ID matrices are always produced for each ASV, AminoType and all cASV fasta files produced.
+vAMPirus generates nucleotide-based counts tables using vsearch and protein-based counts tables using DIAMOND and a custom bash script. Counts tables and percent ID matrices are always produced for each ASV, AminoType and all cASV fasta files produced.
Here are the parameters you can edit at lines 61-70:
@@ -1093,13 +1207,19 @@ The "--asvcountID" is the percent ID during global alignment that vsearch would
Protein-based counts file generation has a few more parameters the user can alter: "--ProtCountsBit" is the minimum bitscore for an alignment to be recorded, "--ProtCountID" is the minimum percent amino acid similarity an alignment needs to have to be recorded, and "--ProtCountsLength" is the minimum alignment length for a hit to be recorded.
-
## Phylogenetic analysis and model testing
-Phylogenetic trees are produced automatically for ASVs (unless --ncASV specified), ncASVs, pcASVs and AminoTypes using IQ-TREE. All produced sequence fastas are aligned using the MAFFT algorithm then alignments are trimmed automatically using TrimAl.
+Phylogenetic trees are produced automatically for ASVs (unless --ncASV specified), ncASVs, pcASVs and aminotypes using IQ-TREE. All produced sequence fastas are aligned using the MAFFT algorithm then alignments are trimmed automatically using TrimAl.
Post alignment and trimming, there is some flexibility in this process where you can specify a few different aspects of the analysis:
+### Coloring nodes on produced trees in the Analyze report
+
+You can tell vAMPirus to color nodes on produced phylogenies based on taxonomy or Minimum Entropy Decomposition Group ID. Edit the option below in the config file.
+
+// Color nodes on phylogenetic tree in Analyze report with MED Group information (nodeCol="MED") or taxonomy (nodeCol=TAX) hit. If you would like nodes colored by sequence ID, leave nodeCol="" below.
+ nodeCol=""
+
### Substitution model testing
ModelTest-NG is always ran to determine the best substitution model and all of its output is stored for the users review.
@@ -1115,12 +1235,13 @@ By default IQTREE will determine the best model to use with ModelFinder Plus.
### Bootstrapping
-IQ-TREE is capable of performing parametric or non-parametric bootstrapping. You can specify which one using "--parametric" or "--nonparametric" and to set how many boostraps to perform, you would use "--boots #ofbootstraps" or edit line 114 in the vampirus.config file.
+IQ-TREE is capable of performing parametric or non-parametric bootstrapping. You can specify which one using "--parametric" or "--nonparametric" and to set how many bootstraps to perform, you would use "--boots #ofbootstraps" or edit lime 114 in the vampirus.config file.
-Here is an example for creating a tree using the model determined by ModelTest-NG, non-parametric boostrapping and 500 bootstraps:
+Here is an example for creating a tree using the model determined by ModelTest-NG, non-parametric bootstrapping and 500 bootstraps:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --pcASV --clusterAAIDlist .85,.90,.96 --ModelTnt --ModelTaa --nonparametric --boots 500
+
### Custom IQ-TREE command
The default IQ-TREE command looks like this:
@@ -1160,29 +1281,73 @@ AminoType IQTREE command ->
iqtree -s aminotype_alighnment.fasta --prefix TestRun --redo -T auto -option1 A -option2 B -option3 C
-
## Taxonomy Inference
-vAMPirus uses Diamond blastx/blastp and the provided protein database to assign taxonomy to ASVs/cASVs/AminoTypes. There are summary files generated, one as a phyloseq object and the other as a .tsv with information in a different
- arrangement. Results are also visualized as a donut graph in the final reports. You can adjust the following parameters:
+vAMPirus uses DIAMOND blastx/blastp and the provided protein database to infer taxonomy of amplicons to ASVs/cASVs/AminoTypes. There are summary files generated, one in the format compatible with phyloseq and the other as a .tsv with information in a different arrangement. Results are also visualized as a donut graph in the final reports.
+
+First, lets take a look at the taxonomy section of the configuration file:
+
+Block 1 -
+
+ // Taxonomy inference parameters
+
+ //Parameters for diamond command
+ // Set minimum bitscore for best hit in taxonomy assignment
+ bitscore="50"
+ // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment
+ minID="40"
+ // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment
+ minaln="30"
+ // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive)
+ sensitivity="ultra-sensitive"
+
+The first block of options are related to the DIAMOND command.
+
+Block 2 -
+
+ // Database information
+ // Specify name of database to use for analysis
+ dbname="DATABASENAME"
+ // Path to Directory where database is being stored - vAMPirus will look here to make sure the database with the name provided above is present and built
+ dbdir="DATABASEDIR"
+ // Set database type (NCBI or RVDB). Lets vAMPirus know which sequence header format is being used and must be set to NCBI when using RefSeq or Non-Redundant databases. -> dbtype="NCBI" to toggle use of RefSeq header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers (see manual)
+ dbtype="TYPE"
- // Taxonomy inference parameters
- // Specify name of database to use for analysis
- dbname="DATABASENAME.FASTA"
- // Path to Directory where database is being stored
- dbdir="/PATH/TO/DATABASE/DIRECTORY"
- // Toggle use of RefSeq header format; default is Reverence Viral DataBase (RVDB)
- refseq="F"
+The second block of options is regarding the database that will be used for the analysis. "dbname" should be the name of the reference fasta file, for example if using the RVDB, dbname would = "U-RVDBv21.0-prot.fasta". The "dbtype" option signals which sequence header format is being used in the reference database. To use the DIAMOND taxonomy assignment feature (see below), you must be using the NCBI style sequence headers.
-This was mentioned in an earlier section of the docs, but before running vAMPirus without the "--skipTaxonomy" option set, you should add the name and path of the database you would like to use at lines 74 and 76 in the config file,
- respectively. The database, currently, needs to be protein sequences and be in fasta format. The database can be a custom database but for proper reporting of the results, the headers need to follow either RVDB or RefSeq
- header formats:
+The database, currently, needs to be protein sequences and be in fasta format. The database can be a custom database but for proper reporting of the results, the headers need to follow either RVDB or NCBI/RefSeq header formats:
1. RVDB format (default) -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]"
2. NCBI NR/RefSeq format -> ">KJX92028.1 hypothetical protein TI39_contig5958g00003 [Zymoseptoria brevis]"
-NOTE: By default, vAMPirus assumes the headers are in RVDB format, to trigger the use of NCBI RefSeq format, edit the "F" to "T" at line 78 in the config file or add "--refseq T" to the launch command.
+Block 3 -
+
+ // Classification settings - if planning on inferring LCA from RVDB annotation files OR using NCBI taxonomy files, confirm options below are accurate.
+ // Path to directory RVDB hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using RVDB LCA.
+ dbanno="DATABASEANNOT"
+ // Set lca="T" if you would like to add "Lowest Common Ancestor" classifications to taxonomy results using information provided by RVDB annotation files (works when using NCBI or RVDB databases) - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus"
+ lca="LCAT"
+ // DIAMOND taxonomy inference using NCBI taxmap files (can be downloaded using the startup script using the option -t); set to "true" for this to run (ONLY WORKS WITH dbtype="NCBI")
+ ncbitax="false"
+
+The third block of options is regarding the two different methods thats could be used to get putative taxonomic classifications for your sequences:
+
+1. Grabbing "Lowest Common Ancestor" (LCA) information from the annotation files associated with the Reference Virus DataBase (RVDB; https://rvdb-prot.pasteur.fr/).
+
+By default, these annotation files are downloaded when you use the startup script to download any of the possible three databases. The "dbanno" variable refers to the path to the annotation files, if using the startup script, this will automatically be edited.
+
+The LCA feature works by searching the RVDB annotation files for the accession number of the aligned-to reference sequence so this feature can be used when "dbtype" equals either "NCBI" or "RVDB". Please note, however, that the aligned-to reference sequence might not be found in the annotation files and some sequences may not have the LCA information available within its annotation file. This is a quick way to assign some degree of classification that might help in future binning or figure making.
+
+You can turn on this feature of the pipeline by editing lime 121 in the vampirus.config file making "lca="true"" or you can set this in the launch command like so:
+
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats --lca true
+
+2. Using the NCBI taxonomy files (https://www.ncbi.nlm.nih.gov/taxonomy ; https://www.ncbi.nlm.nih.gov/books/NBK53758/) and the DIAMOND taxonomy assignment feature.
+
+You can tell vAMPirus to use the DIAMOND taxonomy assignment feature that leverages the taxonomy identifier (TaxId) of aligned-to reference sequences. It will then add putative taxonomic information to your sequences in the *quick_TaxBreakdown.csv file stored in the vAMPirus results directory.
+
+NOTE=> To use the DIAMOND taxonomy feature and the NCBI taxonomy files you must be using the NCBI header format. So, this feature will only be used when "dbtype="NCBI"" and "ncbitax="true"" within the configuration file.
## EMBOSS Analyses
@@ -1208,19 +1373,15 @@ NOTE=> Be sure that there is more than 1 sample in each treatment category or th
# vAMPirus output
-There are several files created throughout the vAMPirus pipeline that are stored and organized in directories within the specified results/output directory (ex. ${working_directory}/results; line 27 in the configuration file). We will go through the structure of the output directory and where to find which files here:
+There are several files created throughout the vAMPirus pipeline that are stored and organized in directories within the specified results/output directory (ex. ${working_directory}/results; line 24 in the configuration file). We will go through the structure of the output directory and where to find which files here:
-## Pipeline performance information - ${working_directory}/results/PipelinePerformance/
+## Pipeline performance information - ${working_directory}/${outdir}/PipelinePerformance/
Nextflow produces a couple files that breakdown how and what parts of the vAMPirus pipeline was ran. The first file is a report that contains information on how the pipeline performed along with other pipeline performance-related information like how much memory or CPUs were used during certain processes. You can use this information to alter how many resources you would request for a given task in the pipeline. For example, the counts table generation process may take a long time with the current amount of resources requested, you can see this in the report then edit the resources requested at lines 144-183 in the vAMPirus configuration file. The second file produced by Nextflow is just the visualization of the workflow and analyses ran as a flowchart.
-## Output of "--DataCheck" - ${working_directory}/results/DataCheck
-
-The DataCheck performed by vAMPirus includes "ReadProcessing", "Clustering", and "Report" generation. Here again is the launch command to run the DataCheck mode:
-
- `nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck`
+## Output of read processing - ${working_directory}/${outdir}/ReadProcessing
-### ReadProcessing - ${working_directory}/results/DataCheck/ReadProcessing
+### ReadProcessing - ${working_directory}/${outdir}/ReadProcessing
Within the ReadProcessing directory you will find all files related to each step of read processing:
@@ -1240,7 +1401,13 @@ Similar to the adapter removal directory, here you have the clean read libraries
There is a little bit more going on in this directory compared to the others. The first major file to pay attention to here is the file \*\_merged_clean_Lengthfiltered.fastq. This is the "final" merged read file that contains all merged reads from each samples and is used to identify unique sequences and then ASVs. "Pre-filtered" and "pre-cleaned" combined merged read files can be found in "./LengthFiltering". If you would like to review or use the separate merged read files per sample, these fastq files are found in the "./Individual" directory. Finally, a fasta file with unique sequences are found in the "./Uniques" directory and the "./Histograms" directory is full of several different sequence property (length, per base quality, etc.) histogram files which can be visualized manually and reviewed in the DataCheck report.
-### Clustering - ${working_directory}/results/DataCheck/Clustering
+## Output of "--DataCheck" - ${working_directory}/${outdir}/DataCheck
+
+The DataCheck performed by vAMPirus includes "ReadProcessing", "Clustering", and "Report" generation. Here again is the launch command to run the DataCheck mode:
+
+ `nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck`
+
+### Clustering - ${working_directory}/${outdir}/DataCheck/Clustering
As the name would suggest, the files within this directory are related to the clustering process of "--DataCheck". There isn't too much in here, but here is the breakdown anyway:
@@ -1250,13 +1417,13 @@ In this directory, there is the fasta file with the generated ASV sequences and
2. Nucleotide - ${working_directory}/results/DataCheck/Clustering/Nucleotide
-This directory stores a .csv file that shows the number of clusters or ncASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report.
+This directory stores a .csv file that shows the number of clusters or ncASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. In this directory you will also find Shannon Entropy analysis results files.
3. Aminoacid - ${working_directory}/results/DataCheck/Clustering/Aminoacid
-Similar to Nucleotide, the Aminoacid directory contained the .csv that shows the number of clusters or pcASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report.
+Similar to Nucleotide, the Aminoacid directory contained the .csv that shows the number of clusters or pcASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. In this directory you will also find Shannon Entropy analysis results files.
-### Report - ${working_directory}/results/DataCheck/Report
+### Report - ${working_directory}/${outdir}/DataCheck/Report
In this directory, you will find a .html DataCheck report that can be opened in any browser. The report contains the following information and it meant to allow the user to tailor their vAMPirus pipeline run to their data (i.e. maximum read length, clustering percentage, etc.):
@@ -1275,59 +1442,39 @@ In this section of the report, vAMPirus is showing the number of nucleotide- and
NOTE: Most, if not all, plots in vAMPirus reports are interactive meaning you can select and zoom on certain parts of the plot or you can use the legend to remove certain samples.
-## Output of "--Analyze" - ${working_directory}/results/Analyze
+## Output of "--Analyze" - ${working_directory}/${outdir}/Analyze
Depending on which optional arguments you add to your analyses (e.g. --pcASV, --ncASV, skip options), you will have different files produced, here we will go through the output of the full analysis stemming from this launch command:
nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats
-### ReadProcessing - ${working_directory}/results/Analyze/ReadProcessing
-
-Very similar to the "ReadProcessing" directory created in DataCheck, you will find the following:
-
-1. FastQC - ${working_directory}/results/Analyze/ReadProcessing/FastQC
-
-In this directory you will find FastQC html reports for pre-cleaned and post-cleaned individual read libraries.
-
-2. AdapterRemoval - ${working_directory}/results/Analyze/ReadProcessing/AdapterRemoval
-
-Here we have resulting fastq files with adapter sequences removed. Fastp also generates its own reports which can also be found in "./fastpOut".
-
-3. PrimerRemoval - ${working_directory}/results/Analyze/ReadProcessing/PrimerRemoval
-
-Similar to the adapter removal directory, here you have the clean read libraries that have had adapter and primer sequences removed.
-
-4. ReadMerging - ${working_directory}/results/Analyze/ReadProcessing/ReadMerging
-
-There is a little bit more going on in this directory compared to the others. The first major file to pay attention to here is the file \*\_merged_clean_Lengthfiltered.fastq. This is the "final" merged read file that contains all merged reads from each samples and is used to identify unique sequences and then ASVs. "Pre-filtered" and "pre-cleaned" combined merged read files can be found in "./LengthFiltering". If you would like to review or use the separate merged read files per sample, these fastq files are found in the "./Individual" directory. Finally, a fasta file with unique sequences are found in the "./Uniques" directory and the "./Histograms" directory is full of several different sequence property (length, per base quality, etc.) histogram files which can be visualized manually and reviewed in the DataCheck report if ran before Analyze.
-
-### Clustering - ${working_directory}/results/Analyze/Clustering
+### Clustering - ${working_directory}/${outdir}/Analyze/Clustering
The clustering directory will contain all files produced for whichever clustering technique you specified (with the launch command above, all are specified):
-1. ASVs - ${working_directory}/results/Analyze/Clustering/ASVsIn this directory, there is the fasta file with the generated ASV sequences and there is another directory "./ChimeraCheck" where the pre-chimera filered ASV fasta sits.
+1. ASVs -- ${working_directory}/results/Analyze/Clustering/ASVs -- In this directory, there is the fasta file with the generated ASV sequences and there is another directory "./ChimeraCheck" where the pre-chimera filered ASV fasta sits. In this directory, if --asvMED was set to run, you will be a MED/ directory containing all output files from oligotyping analyses.
-2. AminoTypes - ${working_directory}/results/Analyze/Clustering/AminoTypesThe AminoTypes directory has a few different subdirectories, in the main directory, however, is the fasta file with the AminoTypes used in all subsequent analyses. The first subdirectory is called "Translation" which includes the raw ASV translation file along with a report spit out by VirtualRibosome. The next subdirectory is "Problematic", where any translations that were below the given "--minAA" length will be reported, if none were deemed "problematic" then the directory will be empty. All problematic amino acid sequence AND their corresponding ASVs are stored in fasta files for you to review. The final subdirectory is "SummaryFiles" where you can find a "map" of sorts to track which ASVs contributed to which AminoTypes and a .gc file containing information on length of translated sequences.
+2. AminoTypes -- ${working_directory}/results/Analyze/Clustering/AminoTypes -- The AminoTypes directory has a few different subdirectories, in the main directory, however, is the fasta file with the AminoTypes used in all subsequent analyses. The first subdirectory is called "Translation" which includes the raw ASV translation file along with a report spit out by VirtualRibosome. The next subdirectory is "Problematic", where any translations that were below the given "--minAA" length will be reported, if none were deemed "problematic" then the directory will be empty. All problematic amino acid sequence AND their corresponding ASVs are stored in fasta files for you to review. The final subdirectory is "SummaryFiles" where you can find a "map" of sorts to track which ASVs contributed to which AminoTypes and a .gc file containing information on length of translated sequences. In this directory, if --aminoMED was set to run, you will be a MED/ directory containing all output files from oligotyping analyses.
-3. ncASV - ${working_directory}/results/Analyze/Clustering/ncASVIn this directory, you will find the fasta files corresponding to the clustering percentage(s) you specified for the run.
+3. ncASV -- ${working_directory}/results/Analyze/Clustering/ncASV -- In this directory, you will find the fasta files corresponding to the clustering percentage(s) you specified for the run.
-4. pcASV - ${working_directory}/results/Analyze/Clustering/pcASVLooking in this directory, you probably notice some similar subdirectories. The pcASV directory also contains the Summary, Problematic, and Translation subsirectories we saw in the AminoType directory. The other important files in this directory is the nucleotide and amino acid versions of the pcASVs generated for whichever clustering percentage(s) specified.An important note for when creating pcASVs is that the subsequent analyses (phylogenies, taxonomy assignment, etc.) are run on both nucloetide and amino acid pcASV fastas. To create these files, vAMPirus translates the ASVs, checks for problematic sequences, then clusters the translated sequences by the given percentage(s). After clustering, vAMPirus will go pcASV by pcASV extracting the nucleotide sequences of the ASVs that clustered within a given pcASV. The extracted nucleotide sequences are then used to generate a consensus nucleotide sequence(s) per pcASV.
+4. pcASV -- ${working_directory}/results/Analyze/Clustering/pcASV -- Looking in this directory, you probably notice some similar subdirectories. The pcASV directory also contains the Summary, Problematic, and Translation subsirectories we saw in the AminoType directory. The other important files in this directory is the nucleotide and amino acid versions of the pcASVs generated for whichever clustering percentage(s) specified.An important note for when creating pcASVs is that the subsequent analyses (phylogenies, taxonomy assignment, etc.) are run on both nucloetide and amino acid pcASV fastas. To create these files, vAMPirus translates the ASVs, checks for problematic sequences, then clusters the translated sequences by the given percentage(s). After clustering, vAMPirus will go pcASV by pcASV extracting the nucleotide sequences of the ASVs that clustered within a given pcASV. The extracted nucleotide sequences are then used to generate a consensus nucleotide sequence(s) per pcASV.
-### Analyses - ${working_directory}/results/Analyze/Analyses
+### Analyses - ${working_directory}/${outdir}/Analyze/Analyses
For each clustering technique (i.e. ASVs, AminoTypes, ncASVs and pcASVs) performed in a given run, resulting taxonomic unit fastas will go through the following analyses (unless skip options are used):
-1. Counts - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/CountsThe Counts directory is where you can find the counts tables as .csv files (and .biome as well for nucleotide counts tables).
+1. Counts -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Counts -- The Counts directory is where you can find the counts tables as .csv files (and .biome as well for nucleotide counts tables).
-2. Phylogeny - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/PhylogenyUnless told otherwise, vAMPirus will produce phylogenetic trees for all taxonomic unit fastas using IQ-TREE. The options for this analysis was discussed in a previous section of the docs. In the phylogeny output directory, you will find three subdirectories: (i) ./Alignment - contains trimmed MAFFT alignment used for tree, (ii) ./ModelTest - contains output files from substitution model prediction with ModelTest-NG, and (iii) ./IQ-TREE - where you can find all output files from IQ-TREE with the file of (usual) interest is the ".treefile".
+2. Phylogeny -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Phylogeny -- Unless told otherwise, vAMPirus will produce phylogenetic trees for all taxonomic unit fastas using IQ-TREE. The options for this analysis was discussed in a previous section of the docs. In the phylogeny output directory, you will find three subdirectories: (i) ./Alignment - contains trimmed MAFFT alignment used for tree, (ii) ./ModelTest - contains output files from substitution model prediction with ModelTest-NG, and (iii) ./IQ-TREE - where you can find all output files from IQ-TREE with the file of (usual) interest is the ".treefile".
-3. Taxonomy - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/TaxonomyvAMPirus uses Diamond blastp/x and the supplied PROTEIN database for taxonomy assignment of sequences. In the Taxonomy directory, you will find (i) a subdirectory called "DiamondOutput" which contains the original output file produced by Diamond, (ii) a fasta file that has taxonomy assignments within the sequence headers, and (iii) three different summary files (one being a phyloseq object with taxonomic information, a tab-separated summary file for review by the user and a summary table looking at abundance of specific hits).
+3. Taxonomy -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Taxonomy -- vAMPirus uses DIAMOND blastp/x and the supplied PROTEIN database for taxonomy assignment of sequences. In the Taxonomy directory, you will find (i) a subdirectory called "DIAMONDOutput" which contains the original output file produced by DIAMOND, (ii) a fasta file that has taxonomy assignments within the sequence headers, and (iii) three different summary files (one being a phyloseq object with taxonomic information, a tab-separated summary file for review by the user and a summary table looking at abundance of specific hits).
-4. Matrix - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/MatrixThe Matric directory is where you can find all Percent Identity matrices for produced ASV/cASV/AmintoType fastas.
+4. Matrix -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Matrix -- The Matric directory is where you can find all Percent Identity matrices for produced ASV/cASV/AmintoType fastas.
-5. EMBOSS - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/EMBOSSSeveral different protein physiochemical properties for all amino acid sequences are assessed using EMBOSS scripts (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/groups.html). There are four different subdirectories within EMBOSS, these include (i) ./ProteinProperties - contains files and plots regarding multiple different physiochemical properties (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/pepstats.html), (ii) ./IsoelectricPoint - contains a text file and a .svg image with plots showing the isoelectric point of protein (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/iep.html), (iii) ./HydrophobicMoment - information related to hydrophobic moments of amino acid sequences (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/hmoment.html), and (iv) ./2dStructure - information about 2D structure of proteins (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/protein_2d_structure_group.html).
+5. EMBOSS -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/EMBOSS -- Several different protein physiochemical properties for all amino acid sequences are assessed using EMBOSS scripts (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/groups.html). There are four different subdirectories within EMBOSS, these include (i) ./ProteinProperties - contains files and plots regarding multiple different physiochemical properties (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/pepstats.html), (ii) ./IsoelectricPoint - contains a text file and a .svg image with plots showing the isoelectric point of protein (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/iep.html), (iii) ./HydrophobicMoment - information related to hydrophobic moments of amino acid sequences (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/hmoment.html), and (iv) ./2dStructure - information about 2D structure of proteins (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/protein_2d_structure_group.html).
-### FinalReport - ${working_directory}/results/Analyze/FinalReport
+### FinalReport - ${working_directory}/${outdir}/Analyze/FinalReports
vAMPirus produces final reports for all taxonomic unit fastas produced in the run. These reports contain the following information:
@@ -1343,7 +1490,7 @@ This is a plot that looks at number of reads per sample, similar to what is seen
4. Diversity analyses box Plots
-The plots in order are (i) Shannon Diversity, (ii) Simpson Diversity, (iii) Species Richness.
+The plots in order are (i) Shannon Diversity, (ii) Simpson Diversity, (iii) Richness.
Stats tests included with "--stats":
@@ -1358,11 +1505,11 @@ Stats tests included with "--stats":
5. Distance to centroid box plot
-6. NMDS plots (2D and 3D)
+6. NMDS plots (2D and 3D) -- PCoA (2D and 3D) if NMDS does not converge
-7. Relative ASV/cASV abundance per sample bar chart
+7. Relative sequence abundance per sample bar chart
-8. Absolute ASV/cASV abundance per treatment bar chart
+8. Absolute sequence abundance per treatment bar chart
9. Pairwise percent ID heatmap
@@ -1370,12 +1517,14 @@ Stats tests included with "--stats":
11. Visualized phylogenetic tree
+12. Post-Minimum Entropy Decomposition Analyses (combination of above)
+
# All of the options
-Usage:
+UUsage:
- nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV]
+ nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] [--asvMED] [--aminoMED] [--stats]
--Help options--
@@ -1399,6 +1548,12 @@ Usage:
--pcASV Set this option to have vAMPirus cluster nucleotide and translated ASVs into protein-based operational taxonomic units (pcASVs) - See options below to define a single percent similarity or a list
+--Minimum Entropy Decomposition arguments--
+
+ --asvMED Set this option to perform Minimum Entropy Decomposition on ASV sequences, see manual for more information. You will need to set a value for --asvC to perform this analysis
+
+ --aminoMED Set this option to perform Minimum Entropy Decomposition on AminoType sequences, see manual for more information. You will need to set a value for --aminoC to perform this analysis
+
--Skip options--
--skipReadProcessing Set this option to skip all read processing steps in the pipeline
@@ -1407,7 +1562,9 @@ Usage:
--skipAdapterRemoval Set this option to skip adapter removal in the pipeline
- --skipPrimerRemoval Set this option to skup Skip primer removal process
+ --skipPrimerRemoval Set this option to skip primer removal process
+
+ --skipMerging Set this option to skip read merging
--skipAminoTyping Set this option to skip AminoTyping processes
@@ -1415,6 +1572,10 @@ Usage:
--skipPhylogeny Set this option to skip phylogeny processes
+ --skipEMBOSS Set this option to skip EMBOSS processes
+
+ --skipReport Set this option to skip html report generation
+
**NOTE** Most opitons below can be set using the configuration file (vampirus.config) to avoid a lengthy launch command.
--Project/analysis information--
@@ -1438,6 +1599,12 @@ Usage:
--maxEE Use this option to set the maximum expected error rate for vsearch merging. Default is 1.
+ --diffs Maximum number of non-matching nucleotides allowed in overlap region.
+
+ --maxn Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded.
+
+ --minoverlap Minimum length of overlap for sequence merging to occur for a pair.
+
--Primer removal--
@@ -1449,7 +1616,7 @@ Usage:
--minkmer Minimum kmer length for primer removal (default = 3)
- --minilen Minimum read length after adapter and primer removal (default = 200)
+ --minilen Minimum non-merged read length after adapter and primer removal (default = 100)
Single primer set removal-
@@ -1470,18 +1637,23 @@ Usage:
--alpha Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2)
- --minSize Minimum size or representation for sequence to be considered in ASV generation
+ --minSize Minimum size or representation in the dataset for sequence to be considered in ASV generation
--clusterNuclID With --ncASV set, use this option to set a single percent similarity to cluster nucleotide ASV sequences into ncASVs by [ Example: --clusterNuclID .97 ]
- --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide-based clustering of ASVs with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ]
+ --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide clustering with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ]
- --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generate pcASVs[ Example: --clusterAAID .97 ]
+ --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generation pcASVs [ Example: --clusterAAID .97 ]
--clusterAAIDlist With --pcASV set, use this option to perform protein-based ASV clustering to generate pcASVs with a comma separated list of percent similarities [ Example: --clusterAAIDlist .95,.96,.97,.98 ]
--minAA With --pcASV set, use this option to set the expected or minimum amino acid sequence length of open reading frames within your amplicon sequences
+--Minimum Entropy Decomposition--
+
+ --asvC Number of high entropy positions to use for ASV MED analysis and generate "Groups"
+
+ --aminoC Number of high entropy positions to use for AminoType MED analysis and generate "Groups"
--Counts table generation--
@@ -1500,7 +1672,11 @@ Usage:
--dbdir Path to Directory where database is being stored
- --refseq Set "--refseq T" to toggle use of RefSeq header format; default is "F" to use Reverence Viral DataBase (RVDB) header
+ --headers Set taxonomy database header format -> headers= "NCBI" to toggle use of NCBI header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers
+
+ --dbanno Path to directory hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using.
+
+ --lca Set --lca T if you would like to add taxonomic classification to taxonomy results - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus"
--bitscore Set minimum bitscore to allow for best hit in taxonomy assignment
@@ -1539,22 +1715,20 @@ Usage:
--trymax Maximum number of iterations performed by metaMDS
-
# Usage examples
Here are some example launch commands:
## Running the --DataCheck
-There is really only one launch command to run for --DataCheck:
-`nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck`
+`nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck [--asvMED] [--aminoMED]`
-Just submit this launch command with the correct paths and vAMPirus will run the DataCheck and produce a report for you to review. Parameters like --minAA or --maxLen apply to the --DataCheck.
+Just submit this launch command with the correct paths and vAMPirus will run the DataCheck and produce a report for you to review. Para
## Running the --Analyze
-### Run it all with a list of cluster IDs!
+### Run it all with a list of cluster IDs
`nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --minLen 400 --maxLen 420 --clusterNuclIDlist .91,.92,.93 --clusterAAIDlist .91,.93,.95,.98`
diff --git a/example_data/conf/ex_reports/example_Analyze_Report.html b/example_data/conf/ex_reports/example_Analyze_Report.html
deleted file mode 100644
index bc65412..0000000
--- a/example_data/conf/ex_reports/example_Analyze_Report.html
+++ /dev/null
@@ -1,882 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-vAMPirus Analyze Report vAMPtest_ASVs
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
vAMPirus Analyze Report vAMPtest_ASVs
-
Generated on: 2021-02-18 20:14:11
-
-
-
-
-
-
-
-
-
-
-
NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report.
-
-
-
- Pre- and Post-Adapter Removal Read Stats
-
-
-
-
-
-
-
-
-
Total number of reads before and after adapter removal
-
-
-
-
-
Forward (R1) and reverse (R2) read length before and after adapter removal
-
-
-
-
- Number of Reads Per Sample
-
-
-
-
-
-
-
- Rarefaction
-
-
-
-
-
- Diversity Analyses Plots
-
-
-
-
-
-
Shannon diversty
-
-
-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-## Shapiro-Wilk normality test
-##
-## data: resid(shannonaov)
-## W = 0.9291, p-value = 0.5732
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-## Bartlett test of homogeneity of variances
-##
-## data: index by treatment
-## Bartlett's K-squared = 1.0859, df = 1, p-value = 0.2974
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-## Df Sum Sq Mean Sq F value Pr(>F)
-## treatment 1 0.0527 0.0527 0.15 0.718
-## Residuals 4 1.4061 0.3515
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-## Tukey multiple comparisons of means
-## 95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = shannondata5_2)
-##
-## $treatment
-## diff lwr upr p adj
-## Group2-Group1 -0.1874213 -1.531512 1.15667 0.7183604
-##
-##
-## --------------------------------------------------------------
-
-
-
-
Simpson diversty
-
-
-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-## Shapiro-Wilk normality test
-##
-## data: resid(simpsonaov)
-## W = 0.88738, p-value = 0.3047
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-## Bartlett test of homogeneity of variances
-##
-## data: index by treatment
-## Bartlett's K-squared = 0.74903, df = 1, p-value = 0.3868
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-## Df Sum Sq Mean Sq F value Pr(>F)
-## treatment 1 0.0052 0.00519 0.059 0.82
-## Residuals 4 0.3506 0.08764
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-## Tukey multiple comparisons of means
-## 95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = simpsondata5_2)
-##
-## $treatment
-## diff lwr upr p adj
-## Group2-Group1 -0.05883652 -0.7299494 0.6122763 0.8196598
-##
-##
-## --------------------------------------------------------------
-
-
-
-
Species Richness
-
-
-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-## Shapiro-Wilk normality test
-##
-## data: resid(richaov)
-## W = 0.93289, p-value = 0.6026
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-## Bartlett test of homogeneity of variances
-##
-## data: index by treatment
-## Bartlett's K-squared = 3.2991, df = 1, p-value = 0.06932
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-## Df Sum Sq Mean Sq F value Pr(>F)
-## treatment 1 7.288 7.288 1.561 0.28
-## Residuals 4 18.670 4.667
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-## Tukey multiple comparisons of means
-## 95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = richdata5_2)
-##
-## $treatment
-## diff lwr upr p adj
-## Group2-Group1 2.204189 -2.693449 7.101827 0.2795884
-##
-##
-## --------------------------------------------------------------
## Run 0 stress 4.35166e-05
-## Run 1 stress 9.206128e-05
-## ... Procrustes: rmse 0.2192141 max resid 0.4447335
-## Run 2 stress 0
-## ... New best solution
-## ... Procrustes: rmse 0.1475603 max resid 0.2719593
-## Run 3 stress 0.1967694
-## Run 4 stress 0
-## ... Procrustes: rmse 0.1503594 max resid 0.2805032
-## Run 5 stress 0
-## ... Procrustes: rmse 0.1890022 max resid 0.2367295
-## Run 6 stress 0
-## ... Procrustes: rmse 0.1525354 max resid 0.2259206
-## Run 7 stress 0
-## ... Procrustes: rmse 0.1446005 max resid 0.2074821
-## Run 8 stress 0
-## ... Procrustes: rmse 0.132074 max resid 0.2500861
-## Run 9 stress 0
-## ... Procrustes: rmse 0.1888654 max resid 0.3103119
-## Run 10 stress 0
-## ... Procrustes: rmse 0.1321098 max resid 0.2335853
-## Run 11 stress 0
-## ... Procrustes: rmse 0.1086904 max resid 0.198957
-## Run 12 stress 0
-## ... Procrustes: rmse 0.184796 max resid 0.2895882
-## Run 13 stress 0
-## ... Procrustes: rmse 0.1985743 max resid 0.2828424
-## Run 14 stress 0
-## ... Procrustes: rmse 0.1839276 max resid 0.339448
-## Run 15 stress 0
-## ... Procrustes: rmse 0.1569956 max resid 0.2738898
-## Run 16 stress 0
-## ... Procrustes: rmse 0.08147595 max resid 0.1293814
-## Run 17 stress 0
-## ... Procrustes: rmse 0.1231843 max resid 0.1929613
-## Run 18 stress 0
-## ... Procrustes: rmse 0.2154224 max resid 0.3725087
-## Run 19 stress 0
-## ... Procrustes: rmse 0.09658586 max resid 0.169947
-## Run 20 stress 0
-## ... Procrustes: rmse 0.1123304 max resid 0.1810664
-## Run 21 stress 0
-## ... Procrustes: rmse 0.07114148 max resid 0.08562701
-## Run 22 stress 0
-## ... Procrustes: rmse 0.1207696 max resid 0.2088403
-## Run 23 stress 0
-## ... Procrustes: rmse 0.1338395 max resid 0.2011499
-## Run 24 stress 0
-## ... Procrustes: rmse 0.2011198 max resid 0.2879716
-## Run 25 stress 7.960692e-05
-## ... Procrustes: rmse 0.1835879 max resid 0.3361377
-## Run 26 stress 7.465245e-05
-## ... Procrustes: rmse 0.1516833 max resid 0.21161
-## Run 27 stress 4.134366e-05
-## ... Procrustes: rmse 0.158652 max resid 0.1955455
-## Run 28 stress 9.401493e-05
-## ... Procrustes: rmse 0.1705531 max resid 0.3146807
-## Run 29 stress 0
-## ... Procrustes: rmse 0.140281 max resid 0.1899751
-## Run 30 stress 0
-## ... Procrustes: rmse 0.1569951 max resid 0.237474
-## Run 31 stress 0
-## ... Procrustes: rmse 0.21383 max resid 0.3404873
-## Run 32 stress 0
-## ... Procrustes: rmse 0.1831665 max resid 0.2903409
-## Run 33 stress 0
-## ... Procrustes: rmse 0.1862874 max resid 0.2430944
-## Run 34 stress 0
-## ... Procrustes: rmse 0.1398732 max resid 0.194408
-## Run 35 stress 0
-## ... Procrustes: rmse 0.08385465 max resid 0.1208572
-## Run 36 stress 0
-## ... Procrustes: rmse 0.1531769 max resid 0.2654934
-## Run 37 stress 0.1420473
-## Run 38 stress 0
-## ... Procrustes: rmse 0.168821 max resid 0.2760271
-## Run 39 stress 0
-## ... Procrustes: rmse 0.2398039 max resid 0.4101748
-## Run 40 stress 4.958545e-05
-## ... Procrustes: rmse 0.1690092 max resid 0.224612
-## Run 41 stress 0
-## ... Procrustes: rmse 0.1763536 max resid 0.2979007
-## Run 42 stress 8.50985e-05
-## ... Procrustes: rmse 0.2037257 max resid 0.3652396
-## Run 43 stress 0
-## ... Procrustes: rmse 0.1373896 max resid 0.1978843
-## Run 44 stress 0
-## ... Procrustes: rmse 0.1721457 max resid 0.2384637
-## Run 45 stress 8.338233e-06
-## ... Procrustes: rmse 0.1350539 max resid 0.1826055
-## Run 46 stress 8.733644e-05
-## ... Procrustes: rmse 0.1864883 max resid 0.2843063
-## Run 47 stress 0
-## ... Procrustes: rmse 0.1287125 max resid 0.1800035
-## Run 48 stress 1.294414e-05
-## ... Procrustes: rmse 0.06488975 max resid 0.09540012
-## Run 49 stress 0
-## ... Procrustes: rmse 0.1123794 max resid 0.2193415
-## Run 50 stress 0
-## ... Procrustes: rmse 0.0873931 max resid 0.1388494
-## Run 51 stress 0
-## ... Procrustes: rmse 0.1836172 max resid 0.2892836
-## Run 52 stress 0
-## ... Procrustes: rmse 0.1349277 max resid 0.2496178
-## Run 53 stress 0
-## ... Procrustes: rmse 0.1597566 max resid 0.3010393
-## Run 54 stress 0
-## ... Procrustes: rmse 0.1578135 max resid 0.2739341
-## Run 55 stress 0.1420473
-## Run 56 stress 9.580941e-05
-## ... Procrustes: rmse 0.2385671 max resid 0.3099702
-## Run 57 stress 0
-## ... Procrustes: rmse 0.1424363 max resid 0.203378
-## Run 58 stress 0
-## ... Procrustes: rmse 0.1498625 max resid 0.2312135
-## Run 59 stress 0
-## ... Procrustes: rmse 0.08465851 max resid 0.1346232
-## Run 60 stress 0
-## ... Procrustes: rmse 0.1430299 max resid 0.2610707
-## Run 61 stress 0
-## ... Procrustes: rmse 0.1369128 max resid 0.1816959
-## Run 62 stress 0
-## ... Procrustes: rmse 0.08119163 max resid 0.1444967
-## Run 63 stress 0
-## ... Procrustes: rmse 0.1710694 max resid 0.240634
-## Run 64 stress 9.84601e-05
-## ... Procrustes: rmse 0.1878078 max resid 0.2714905
-## Run 65 stress 0
-## ... Procrustes: rmse 0.05651144 max resid 0.09672819
-## Run 66 stress 8.62831e-05
-## ... Procrustes: rmse 0.0727498 max resid 0.09722458
-## Run 67 stress 0
-## ... Procrustes: rmse 0.1958305 max resid 0.2910121
-## Run 68 stress 2.701363e-05
-## ... Procrustes: rmse 0.2225825 max resid 0.3507306
-## Run 69 stress 7.042554e-05
-## ... Procrustes: rmse 0.173566 max resid 0.2626667
-## Run 70 stress 2.745518e-05
-## ... Procrustes: rmse 0.08284383 max resid 0.124135
-## Run 71 stress 0
-## ... Procrustes: rmse 0.1840454 max resid 0.3503948
-## Run 72 stress 0
-## ... Procrustes: rmse 0.1540941 max resid 0.2302387
-## Run 73 stress 0
-## ... Procrustes: rmse 0.07616243 max resid 0.1113843
-## Run 74 stress 0
-## ... Procrustes: rmse 0.1348955 max resid 0.2398141
-## Run 75 stress 0
-## ... Procrustes: rmse 0.1593818 max resid 0.2624199
-## Run 76 stress 0
-## ... Procrustes: rmse 0.1658679 max resid 0.2368498
-## Run 77 stress 0
-## ... Procrustes: rmse 0.1083634 max resid 0.1355875
-## Run 78 stress 0
-## ... Procrustes: rmse 0.173316 max resid 0.2188972
-## Run 79 stress 2.412692e-05
-## ... Procrustes: rmse 0.1513869 max resid 0.2559945
-## Run 80 stress 0
-## ... Procrustes: rmse 0.1286293 max resid 0.1708846
-## Run 81 stress 8.249084e-05
-## ... Procrustes: rmse 0.1600543 max resid 0.2790534
-## Run 82 stress 0
-## ... Procrustes: rmse 0.06425533 max resid 0.09782641
-## Run 83 stress 0
-## ... Procrustes: rmse 0.1859754 max resid 0.3306764
-## Run 84 stress 1.218079e-05
-## ... Procrustes: rmse 0.1579732 max resid 0.2679705
-## Run 85 stress 0
-## ... Procrustes: rmse 0.1373995 max resid 0.1704452
-## Run 86 stress 0
-## ... Procrustes: rmse 0.1135352 max resid 0.1456319
-## Run 87 stress 3.345478e-06
-## ... Procrustes: rmse 0.1136627 max resid 0.1526098
-## Run 88 stress 7.30298e-05
-## ... Procrustes: rmse 0.09087387 max resid 0.1523526
-## Run 89 stress 0
-## ... Procrustes: rmse 0.1294085 max resid 0.20968
-## Run 90 stress 0
-## ... Procrustes: rmse 0.2133609 max resid 0.307295
-## Run 91 stress 0
-## ... Procrustes: rmse 0.208796 max resid 0.3331994
-## *** No convergence -- monoMDS stopping criteria:
-## 88: stress < smin
-## 1: stress ratio > sratmax
-## 2: scale factor of the gradient < sfgrmin
-
## [1] "No Convergence"
-
-
-
-
3D NMDS
-
-
## Run 0 stress 9.179259e-05
-## Run 1 stress 0
-## ... New best solution
-## ... Procrustes: rmse 0.1912644 max resid 0.3064663
-## Run 2 stress 0
-## ... Procrustes: rmse 0.1515697 max resid 0.1940969
-## Run 3 stress 8.011522e-05
-## ... Procrustes: rmse 0.2408658 max resid 0.3233412
-## Run 4 stress 0
-## ... Procrustes: rmse 0.1502567 max resid 0.2373834
-## Run 5 stress 0
-## ... Procrustes: rmse 0.1824273 max resid 0.2309358
-## Run 6 stress 0
-## ... Procrustes: rmse 0.2044446 max resid 0.3064849
-## Run 7 stress 0
-## ... Procrustes: rmse 0.1360824 max resid 0.1868316
-## Run 8 stress 0
-## ... Procrustes: rmse 0.224697 max resid 0.3889081
-## Run 9 stress 1.601269e-06
-## ... Procrustes: rmse 0.2017323 max resid 0.253622
-## Run 10 stress 0
-## ... Procrustes: rmse 0.2332277 max resid 0.2830118
-## Run 11 stress 0
-## ... Procrustes: rmse 0.2094351 max resid 0.3393129
-## Run 12 stress 0
-## ... Procrustes: rmse 0.1924744 max resid 0.2476726
-## Run 13 stress 0
-## ... Procrustes: rmse 0.1723438 max resid 0.2779048
-## Run 14 stress 0
-## ... Procrustes: rmse 0.231938 max resid 0.3082758
-## Run 15 stress 0
-## ... Procrustes: rmse 0.1591339 max resid 0.2941593
-## Run 16 stress 1.844663e-05
-## ... Procrustes: rmse 0.1367305 max resid 0.1893536
-## Run 17 stress 0
-## ... Procrustes: rmse 0.2432104 max resid 0.3742779
-## Run 18 stress 0
-## ... Procrustes: rmse 0.2021536 max resid 0.3190184
-## Run 19 stress 8.261138e-05
-## ... Procrustes: rmse 0.2661264 max resid 0.3395278
-## Run 20 stress 0
-## ... Procrustes: rmse 0.2050021 max resid 0.3470228
-## Run 21 stress 0
-## ... Procrustes: rmse 0.2349675 max resid 0.3899205
-## Run 22 stress 0
-## ... Procrustes: rmse 0.1743172 max resid 0.3327414
-## Run 23 stress 0
-## ... Procrustes: rmse 0.1805854 max resid 0.2484891
-## Run 24 stress 0
-## ... Procrustes: rmse 0.2214565 max resid 0.3735158
-## Run 25 stress 6.602248e-05
-## ... Procrustes: rmse 0.2633094 max resid 0.3328158
-## Run 26 stress 0
-## ... Procrustes: rmse 0.1989565 max resid 0.3402747
-## Run 27 stress 0
-## ... Procrustes: rmse 0.2434265 max resid 0.2853429
-## Run 28 stress 0
-## ... Procrustes: rmse 0.1437204 max resid 0.2008141
-## Run 29 stress 0
-## ... Procrustes: rmse 0.2207409 max resid 0.2922027
-## Run 30 stress 0
-## ... Procrustes: rmse 0.140996 max resid 0.2320052
-## Run 31 stress 0
-## ... Procrustes: rmse 0.1889331 max resid 0.2900138
-## Run 32 stress 0
-## ... Procrustes: rmse 0.1834047 max resid 0.2422162
-## Run 33 stress 0
-## ... Procrustes: rmse 0.2447367 max resid 0.3580808
-## Run 34 stress 0
-## ... Procrustes: rmse 0.2359379 max resid 0.3172238
-## Run 35 stress 0
-## ... Procrustes: rmse 0.2428633 max resid 0.3027342
-## Run 36 stress 0
-## ... Procrustes: rmse 0.2011346 max resid 0.344021
-## Run 37 stress 0
-## ... Procrustes: rmse 0.2398276 max resid 0.4186627
-## Run 38 stress 6.012963e-05
-## ... Procrustes: rmse 0.2290966 max resid 0.3002383
-## Run 39 stress 0
-## ... Procrustes: rmse 0.2681661 max resid 0.4893852
-## Run 40 stress 2.102041e-05
-## ... Procrustes: rmse 0.1696101 max resid 0.2438495
-## Run 41 stress 0
-## ... Procrustes: rmse 0.2461656 max resid 0.3874782
-## Run 42 stress 8.813298e-05
-## ... Procrustes: rmse 0.2458929 max resid 0.3572947
-## Run 43 stress 1.301719e-06
-## ... Procrustes: rmse 0.196147 max resid 0.322213
-## Run 44 stress 0
-## ... Procrustes: rmse 0.1521486 max resid 0.2327491
-## Run 45 stress 7.430631e-05
-## ... Procrustes: rmse 0.2806763 max resid 0.3795962
-## Run 46 stress 0
-## ... Procrustes: rmse 0.1268192 max resid 0.1828082
-## Run 47 stress 0
-## ... Procrustes: rmse 0.1656652 max resid 0.2345423
-## Run 48 stress 0
-## ... Procrustes: rmse 0.2071554 max resid 0.3748716
-## Run 49 stress 8.899945e-05
-## ... Procrustes: rmse 0.2411995 max resid 0.3239697
-## Run 50 stress 8.49387e-05
-## ... Procrustes: rmse 0.2235364 max resid 0.2957555
-## Run 51 stress 8.946254e-05
-## ... Procrustes: rmse 0.2411916 max resid 0.3239556
-## Run 52 stress 0
-## ... Procrustes: rmse 0.1024931 max resid 0.1772733
-## Run 53 stress 0
-## ... Procrustes: rmse 0.1805318 max resid 0.2266111
-## Run 54 stress 0
-## ... Procrustes: rmse 0.222073 max resid 0.3309312
-## Run 55 stress 0
-## ... Procrustes: rmse 0.1984136 max resid 0.2676206
-## Run 56 stress 4.061045e-05
-## ... Procrustes: rmse 0.1650274 max resid 0.2850801
-## Run 57 stress 0
-## ... Procrustes: rmse 0.2278851 max resid 0.350922
-## Run 58 stress 0
-## ... Procrustes: rmse 0.1401086 max resid 0.185101
-## Run 59 stress 8.737338e-05
-## ... Procrustes: rmse 0.2411935 max resid 0.3239583
-## Run 60 stress 0
-## ... Procrustes: rmse 0.2191345 max resid 0.3940798
-## Run 61 stress 2.198444e-05
-## ... Procrustes: rmse 0.2350082 max resid 0.3891647
-## Run 62 stress 0
-## ... Procrustes: rmse 0.2048775 max resid 0.265205
-## Run 63 stress 0
-## ... Procrustes: rmse 0.2213181 max resid 0.3503163
-## Run 64 stress 0
-## ... Procrustes: rmse 0.1501829 max resid 0.2294057
-## Run 65 stress 0
-## ... Procrustes: rmse 0.1411666 max resid 0.2546569
-## Run 66 stress 0
-## ... Procrustes: rmse 0.1464234 max resid 0.2116882
-## Run 67 stress 0
-## ... Procrustes: rmse 0.1652729 max resid 0.2477694
-## Run 68 stress 9.870175e-05
-## ... Procrustes: rmse 0.2505968 max resid 0.3328449
-## Run 69 stress 0.07976986
-## Run 70 stress 0
-## ... Procrustes: rmse 0.194331 max resid 0.2432624
-## Run 71 stress 6.941015e-05
-## ... Procrustes: rmse 0.2130009 max resid 0.3004891
-## Run 72 stress 7.615578e-05
-## ... Procrustes: rmse 0.259831 max resid 0.34112
-## Run 73 stress 0
-## ... Procrustes: rmse 0.1647591 max resid 0.2388743
-## Run 74 stress 0
-## ... Procrustes: rmse 0.1644595 max resid 0.2847747
-## Run 75 stress 0
-## ... Procrustes: rmse 0.1906808 max resid 0.3100185
-## Run 76 stress 0
-## ... Procrustes: rmse 0.146477 max resid 0.2315368
-## Run 77 stress 7.361458e-05
-## ... Procrustes: rmse 0.2474006 max resid 0.3599023
-## Run 78 stress 0
-## ... Procrustes: rmse 0.1476474 max resid 0.2302322
-## Run 79 stress 0
-## ... Procrustes: rmse 0.1500008 max resid 0.1972427
-## Run 80 stress 0
-## ... Procrustes: rmse 0.2317812 max resid 0.3787597
-## Run 81 stress 0
-## ... Procrustes: rmse 0.2462261 max resid 0.3285394
-## Run 82 stress 0
-## ... Procrustes: rmse 0.1343457 max resid 0.1992015
-## Run 83 stress 0
-## ... Procrustes: rmse 0.1171339 max resid 0.2019657
-## Run 84 stress 0
-## ... Procrustes: rmse 0.1754114 max resid 0.255796
-## Run 85 stress 0
-## ... Procrustes: rmse 0.17915 max resid 0.2562237
-## Run 86 stress 0
-## ... Procrustes: rmse 0.2158695 max resid 0.2893261
-## Run 87 stress 0
-## ... Procrustes: rmse 0.1897326 max resid 0.3018936
-## Run 88 stress 8.264709e-05
-## ... Procrustes: rmse 0.2704232 max resid 0.3402092
-## Run 89 stress 0
-## ... Procrustes: rmse 0.1770887 max resid 0.2768628
-## Run 90 stress 7.124833e-05
-## ... Procrustes: rmse 0.2274275 max resid 0.3037865
-## Run 91 stress 0
-## ... Procrustes: rmse 0.2291196 max resid 0.3758455
-## *** No convergence -- monoMDS stopping criteria:
-## 90: stress < smin
-## 1: stress ratio > sratmax
-
## [1] "No Convergence"
-
-
-
- OTU Abundance Per Sample
-
-
-
-
-
-
- OTU Abundance Per Treatment
-
-
-
-
-
-
- Pairwise Percent-ID Heatmap
-
-
-
-
-
-
- Taxonomy Result Visualization
-
-
-
-
-
-
-
-
- Phylogenetic Tree
-
-
-
-
- This tree is a maximum likelihood tree made with IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL.
—————————————————————————————————————————————————————————————————————— NOTE: Plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report. ——————————————————————————————————————————————————————————————————————
-
-
- Pre- and Post-Adapter Removal Read Stats
-
-
-
-
-
-
-
-
-
Total number of reads before and after adapter removal
-
-
-
-
-
Forward (R1) and reverse (R2) read length before and after adapter removal
-
-
-
-
- Post-Merging Read Stats
-
-
-
-
-
-
Pre-filtering reads per sample
-
-
-
-
-
-
-
Post-filtering reads per sample
-
-
-
-
-
-
-
Pre-filtering base frequency per position on reads
-
-
-
-
-
-
-
Post-filtering base frequency per position on reads
-
-
-
-
-
-
-
Pre-filtering mean quality score per position on reads
-
-
-
-
-
-
-
Post-filtering mean quality score per position on reads