diff --git a/README.md b/README.md index 4acdfe9..57b75fd 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [![release](https://img.shields.io/github/v/release/Aveglia/vAMPirus?label=release&logo=github)](https://github.com/Aveglia/vAMPirus/releases/latest) # Table of contents - +* [New in vAMPirus version 2.0.0](#New-in-vAMPirus-version-2.0.0) * [Quick intro](#Quick-intro) * [Contact/support](#Contact/support) * [Getting started](#Getting-started) @@ -21,21 +21,39 @@ * [Running vAMPirus](#Running-vAMPirus) * [Who to cite](#Who-to-cite) +# New in vAMPirus version 2.0.0 + +1. (EXPERIMENTAL) Added Minimum Entropy Decomposition analysis using the oligotyping program produced by the Meren Lab. This allows for sequence clustering based on sequence positions of interest (biologically meaningful) or top positions with the highest Shannon's Entropy (read more here: https://merenlab.org/software/oligotyping/ ; and below). + +2. Added more useful taxonomic classification of sequences leveraging the RVDB annotation database and/or NCBI taxonomy files (see manual for more info). + +3. Replaced the used of MAFFT with muscle v5 (Edgar 2021) for more accurate virus gene alignments (see https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full). + +4. Added multiple primer pair removal to deal with multiplexed amplicon libraries. + +5. ASV filtering - you can now provide a "filter" and "keep" database to remove certain sequences from the analysis + +6. Reduced redundancy of processes and the volume of generated result files per full run (Example - read processing only done once if running DataCheck then Analyze). + +7. Color nodes on phylogenetic trees based on Taxonomy or Minimum Entropy Decomposition results + +8. PCoA plots added to Analyze report if NMDS does not converge. # Quick intro -Viruses are the most abundant biological entities on the planet and with advances in next-generation sequencing technologies, there has been significant effort in deciphering the global virome and its impact in nature (Suttle 2007; Breitbart 2019). A common method for studying viruses in the lab or environment is amplicon sequencing, an economic and effective approach for investigating virus diversity and community dynamics. The highly targeted nature of amplicon sequencing allows in-depth characterization of genetic variants within a specific taxonomic grouping facilitating both virus discovery and screening within samples. Although, the high volume of amplicon data produced combined with the highly variable nature of virus evolution across different genes and virus-types can make it difficult to scale and standardize analytical approaches. Here we present vAMPirus (https://github.com/Aveglia/vAMPirus.git), an automated and easy-to-use virus amplicon sequencing analysis program that is integrated with the Nextflow workflow manager facilitation easy scalability and standardization of analyses. +![vAMPirus general workflow](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vAMPirus_generalflow.png) + The vAMPirus program contains two different pipelines: 1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform future analyses -![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_datacheckUPDATED.png) +![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_datacheckV2.png) 2. Analyze pipeline: a comprehensive analysis of the provided data producing a wide range of results and outputs which includes an interactive report with figures and statistics. NOTE- stats option has changed on 2/19/21; you only need to add "--stats" to the launch commmand without "run" -![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_analysisUPDATED.png) +![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_analyzeV2.png) NOTE => This is a more brief overview of how to install and set up vAMPirus, for more detail see the [manual](https://github.com/Aveglia/vAMPirus/blob/master/docs/HelpDocumentation.md). @@ -50,28 +68,24 @@ If you have a feature request or any feedback/questions, feel free to email vAMP ## Quick order of operations -1. Clone vAMPirus from github -2. Execute the vampirus_startup.sh script to install dependencies and any databases specified -3. Test installation with supplied test dataset -5. Launch the DataCheck pipeline with your dataset and adjust parameters if necessary -6. Launch the Analyze pipeline with your dataset - -### Dependencies (see Who to cite section) - -1. Python versions 3.6/2.7 -2. Diamond version 0.9.30 -3. FastQC version 0.11.9 -4. fastp version 0.20.1 -5. Clustal Omega version 1.2.4 -6. IQ-TREE version 2.0.3 -7. ModelTest-NG version 0.1.6 -8. MAFFT version 7.464 -9. vsearch version 2.14.2 -10. BBMap version 38.79 -11. trimAl version 1.4.1 -12. CD-HIT version 4.8.1 -13. EMBOSS version 6.5.7.0 -14. seqtk version 1.3 +1. Clone vAMPirus from github + +2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases (NOTE: You will need to have the xz program installed before running startup script when downloading the RVDB database) + +3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see EXAMPLE_COMMANDS.txt in the vAMPirus directory for test commands and other examples) + +4. Edit parameters in vampirus.config file + +5. Launch the DataCheck pipeline to get summary information about your dataset (e.g. sequencing success, read quality information, clustering behavior of ASV or AminoTypes) + +6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length, Shannon entropy analysis results) + +7. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset + +8. Explore results directories and produced final reports + + +### Installing dependencies (see Who to cite section) If you plan on using Conda to run vAMPirus, all dependencies will be installed as a Conda environment automatically with the vampirus_startup.sh script. @@ -136,11 +150,11 @@ The startup script provided in the vAMPirus program directory will install Conda ### vAMPirus startup script - To set up and install vAMPirus dependencies, simply move to the vAMPirus directory and run the vampirus_startup.sh script. cd ./vAMPirus; bash vampirus_startup.sh -h + >You can make the vampirus_startup.sh scrip an exectuable with -> chmod +x vampirus_startup.sh ; ./vampirus_startup.sh @@ -152,37 +166,41 @@ You can also use the startup script to install different databases to use for vA 2. The proteic version of the Reference Viral DataBase (RVDB) (See https://f1000research.com/articles/8-530) 3. The complete NCBI NR protein database -To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option. +To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option and you can download the NCBI taxonomy files with the option "-t" (See below). -If we look at the script usage: +If we take a look at the vampirus_startup.sh script usage: - General execution: +General execution: - vampirus_startup.sh -h [-d 1|2|3|4] [-s] +vampirus_startup.sh -h [-d 1|2|3|4] [-s] [-t] - Command line options: + Command line options: - [ -h ] Print help information + [ -h ] Print help information - [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment: + [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment: - 1 - Download the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530) - 2 - Download only NCBIs Viral protein RefSeq database - 3 - Download only the complete NCBI NR protein database - 4 - Download all three databases + 1 - Download only the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530) + 2 - Download only NCBIs Viral protein RefSeq database + 3 - Download only the complete NCBI NR protein database + 4 - Download all three databases - [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container) + [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container) + [ -t ] Set this option to download NCBI taxonomy files needed for DIAMOND to assign taxonomic classification to sequences (works with NCBI type databases only, see manual for more information) -For example, if you would like to install Nextflow, download NCBIs Viral protein RefSeq database, and check/install conda, run: - bash vampirus_startup.sh -d 1 +For example, if you would like to install Nextflow, download NCBIs Viral Protein RefSeq database, the NCBI taxonomy files to use DIAMOND taxonomy assignment feature, and check/install conda, run: + + bash vampirus_startup.sh -d 2 -t and if we wanted to do the same thing as above but skip the Conda check/installation, run: - bash vampirus_startup.sh -d 1 -s + bash vampirus_startup.sh -d 2 -s + +NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed. -NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed. Then move to the vAMPirus directory and run the test commands. +**NEW in version 2.0.0 -> the startup script will automatically download annotation information from RVDB to infer Lowest Common Ancestor (LCA) information for hits during taxonomy assignment. You can also use "-t" to download NCBI taxonomy files to infer taxonomy using the DIAMOND taxonomy classification feature. # Testing vAMPirus installation @@ -203,11 +221,11 @@ OR ### Analyze test => - /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats + `/path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze` OR - nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats + `nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze` # Running vAMPirus @@ -215,11 +233,12 @@ OR If you done the setup and confirmed installation success with the test commands, you are good to get going with your own data. Before getting started edit the configuration file with the parameters and other options you plan to use. Here are some example vAMPirus launch commands: + ### DataCheck pipeline => -Example 1. Launching the vAMPirus DataCheck pipeline using conda +Example 1. Launching the vAMPirus DataCheck pipeline using conda and Shannon Entropy Analysis on ASVs and AminoTypes - nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck + `nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck --asvMED --aminoMED` Example 2. Launching the vAMPirus DataCheck pipeline using Singularity and multiple primer removal with the path to the fasta file with the primer sequences set in the launch command @@ -229,21 +248,19 @@ Example 3. Launching the vAMPirus DataCheck pipeline with primer removal by glob nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck --GlobTrim 20,26 - ### Analyze pipeline => Example 4. Launching the vAMPirus Analyze pipeline with singularity with ASV and AminoType generation with all accesory analyses (taxonomy assignment, EMBOSS, IQTREE, statistics) - nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats + `nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats` Example 5. Launching the vAMPirus Analyze pipeline with conda to perform multiple primer removal and protein-based clustering of ASVs, but skip most of the extra analyses nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --pcASV --skipPhylogeny --skipEMBOSS --skipTaxonomy --skipReport -Example 6. Launching vAMPirus Analyze pipeline with conda to produce only ASV-related results - - nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats +Example 6. Launching vAMPirus Analyze pipeline with conda to produce only ASV and AminoType-based results with Shannon Entropy Analyses with the nodes on produced phylogenies colored based on taxnomy hit + `nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --asvMED --aminoMED --nodeCol TAX --stats` ## Resuming analyses => @@ -251,8 +268,7 @@ If an analysis is interrupted, you can use Nextflows "-resume" option that will For example if the analysis launched with the command from Example 6 above was interrupted, all you would need to do is add the "-resume" to the end of the command like so: - nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats -resume - + `nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --asvMED --aminoMED --nodeCol TAX --stats -resume` # Who to cite: @@ -260,7 +276,7 @@ If you do use vAMPirus for your analyses, please cite the following -> 1. vAMPirus - Veglia A.J., Rivera Vicéns R.E., Grupstra C.G.B., Howe-Kerr L.I., Correa A.M.S. (2021) vAMPirus: An automated, comprehensive virus amplicon sequencing analysis program (Version v1.0.1). Zenodo. http://doi.org/10.5281/zenodo.4549851 -2. Diamond - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176 +2. DIAMOND - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176 3. FastQC - Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ @@ -272,7 +288,7 @@ If you do use vAMPirus for your analyses, please cite the following -> 7. ModelTest-NG - Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B., & Flouri, T. (2020). ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Molecular biology and evolution, 37(1), 291-294. -8. MAFFT - Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780. +8. muscle v5 - R.C. Edgar (2021) "MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping" https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full.pdf 9. vsearch - Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. @@ -287,3 +303,5 @@ If you do use vAMPirus for your analyses, please cite the following -> 14. seqtk - Li, H. (2012). seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub, 767, 69. 15. UNOISE algorithm - R.C. Edgar (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257 + +16. Oligotyping - A. Murat Eren, Gary G. Borisy, Susan M. Huse, Jessica L. Mark Welch (2014). Oligotyping analysis of the human oral microbiome. Proceedings of the National Academy of Sciences Jul 2014, 111 (28) E2875-E2884; DOI: 10.1073/pnas.1409644111 diff --git a/bin/muscle5.0.1278_linux64 b/bin/muscle5.0.1278_linux64 new file mode 100644 index 0000000..53f3e8f Binary files /dev/null and b/bin/muscle5.0.1278_linux64 differ diff --git a/bin/vAMPirus_DC_Report.Rmd b/bin/vAMPirus_DC_Report.Rmd index 3320070..c986bb5 100644 --- a/bin/vAMPirus_DC_Report.Rmd +++ b/bin/vAMPirus_DC_Report.Rmd @@ -4,19 +4,12 @@ date: "Generated on: `r Sys.time()`" output: html_document params: interactive: TRUE - fastpcsv: !r commandArgs(trailingOnly=T)[2] - reads_per_sample_preFilt: !r commandArgs(trailingOnly=T)[3] - read_per_sample_postFilt: !r commandArgs(trailingOnly=T)[4] - preFilt_baseFrequency: !r commandArgs(trailingOnly=T)[5] - postFilt_baseFrequency: !r commandArgs(trailingOnly=T)[6] - preFilt_qualityScore: !r commandArgs(trailingOnly=T)[7] - postFilt_qualityScore: !r commandArgs(trailingOnly=T)[8] - preFilt_averageQuality: !r commandArgs(trailingOnly=T)[9] - postFilt_averageQuaulity: !r commandArgs(trailingOnly=T)[10] - preFilt_length: !r commandArgs(trailingOnly=T)[11] - postFilt_length: !r commandArgs(trailingOnly=T)[12] - number_per_percentage_nucl: !r commandArgs(trailingOnly=T)[13] - number_per_percentage_prot: !r commandArgs(trailingOnly=T)[14] + projtag: !r commandArgs(trailingOnly=T)[1] + skipReadProcessing: !r commandArgs(trailingOnly=T)[2] + skipMerging: !r commandArgs(trailingOnly=T)[3] + skipAdapterRemoval: !r commandArgs(trailingOnly=T)[4] + asvMED: !r commandArgs(trailingOnly=T)[5] + aminoMED: !r commandArgs(trailingOnly=T)[6] --- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, + message = FALSE, + warning = FALSE, + out.width="100%") +``` + +```{r pathways, echo=FALSE} +knitr::include_graphics("vamplogo.png") +``` + +```{r load_libraries, include=FALSE} +library(vegan) +library(tidyverse) +library(scales) +library(cowplot) +library(dplyr) +library(ggtree) +library(plotly) +library(knitr) +library(kableExtra) +library(rmarkdown) +library(processx) +library(ape) +``` + +```{r colors, include=FALSE} +mycol=c('#088da5','#73cdc8','#ff6f61','#7cb8df','#88b04b','#00a199','#6B5B95','#92A8D1','#b0e0e6','#ff7f50','#088d9b','#E15D44','#e19336') +``` +
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +
+

  Pre- and Post-Adapter Removal Read Stats

+
+```{r readstats, echo=FALSE} +if (params$skipReadProcessing == "true" || params$skipMerging == "true" ) { + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else { + if (params$skipAdapterRemoval == "false") { + reads_stats=read.csv("final_reads_stats.csv") + #paged_table(reads_stats,options = list(rows.print = 20)) + knitr::kable(reads_stats, digits = 2, align = 'c', booktabs = TRUE, caption = "Table 1: Read summary stats") %>% + kable_styling(font_size = 12, full_width = F)%>% + scroll_box(width = "100%", height = "100%") + } else { + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") + } +} +``` +
+
+ + +### Total number of reads before and after adapter removal + +```{r readstats_plot, echo=FALSE} +# Plot of reads before and after +if (params$skipReadProcessing == "true" || params$skipMerging == "true" ) { + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else { + if (params$skipAdapterRemoval == "false") { + ptotal <- plot_ly(type="box",marker=list(colors=mycol)) + ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_before, name="Reads before filtering") + ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_after, name="Reads after filtering") + #ptotal <- ptotal %>% layout(title=list(text="Number of reads before and after filtering")) + ptotal <- ptotal %>% layout(legend = list(x=10,y=.5)) + ptotal <- ptotal %>% config(toImageButtonOptions=list(format='svg',filename='TotReads_b4_af_adaptrem', height= 500, width= 800, scale= 1)) + ptotal + } else { + # file fore here + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") + } +} +``` +
+ +### Forward (R1) and reverse (R2) read length before and after adapter removal + +```{r readstats_plot2, echo=FALSE} +# Plot of R1 and R2 before and after +if (params$skipReadProcessing == "true" || params$skipMerging == "true" ) { + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else { + if (params$skipAdapterRemoval == "false") { + pr <- plot_ly(y=reads_stats$R1_before_length, type="box", name="R1 length before") + pr <- pr %>% add_boxplot(y=reads_stats$R1_after_length, name="R1 length after") + pr <- pr %>% add_boxplot(y=reads_stats$R2_before_length, name="R2 length before") + pr <- pr %>% add_boxplot(y=reads_stats$R2_after_length, name="R2 length after") + #pr <- pr %>% layout(title = "R1 and R2 Length") + pr <- pr %>% layout(legend = list(x=10,y=.5)) + pr <- pr %>% config(toImageButtonOptions=list(format='svg',filename='readlen_b4_af_adaptrem', height= 500, width= 800, scale= 1)) + pr + } else { + # file fore here + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="filter_reads.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") + } +} +``` +
+
+
+```{bash load_datasets_bash, include=FALSE} +if [ `ls *_ASV_Grouping.csv | wc -l` -ge 1 ];then + cat *_ASV_Grouping.csv >asv_medfile.csv + mv *_ASV_Groupingcounts.csv asv_groupcounts.csv + cat asv_groupcounts.csv | awk -F "," '{print $1","$2}' >tree_group.csv + mv *_ASV_Group_Reps_iq.treefile grouptree.txt +elif [ `ls *_AminoType_Grouping.csv | wc -l` -ge 1 ];then + cat *_AminoType_Grouping.csv >amino_medfile.csv + mv *_AminoType_Groupingcounts.csv amino_groupcounts.csv + cat amino_groupcounts.csv | awk -F "," '{print $1","$2}' >tree_group.csv + mv *_AminoType_Group_Reps_iq.treefile grouptree.txt +fi +cat *_counts.csv >counts.csv +cat *_PercentID.matrix >matrix.txt +if [ `ls -1 *_PercentID.matrix | wc -l` -ge 1 ];then + cat *_summary_for_plot.csv >sum.csv +else + echo "Taxonomy analysis was skipped" >tax.txt +fi +if [ `ls -1 *_iq.treefile | wc -l` -ge 1 ];then + cat *_iq.treefile >tree.txt +else + echo "Phylogeny analysis was skipped" >tree.txt +fi +if [ `ls -1 *_quicker_taxbreakdown.csv | wc -l` -ge 1 ];then + cat *_quicker_taxbreakdown.csv > quicker_taxbreakdown.csv +fi +``` +```{r load_datasets, include=FALSE} +sample_name="counts.csv" +sample_metadata=params$metadata +#sample_metadata="rna_virus_meta.csv" +data<- read.csv(sample_name, check.names=FALSE) +data2 <-as.data.frame(t(data)) +data2$sample <- row.names(data2) +colnames(data2)<- as.matrix(data2[1,]) +as.data.frame(data2) +data2 <- data2[-1,] + +#X.OTU.ID for X.Sequence. +data2 <- data2 %>% + rename(sample=OTU_ID) +data2dim <- dim(data2) + +##Loading metadata +samples <- read.csv(sample_metadata, header = TRUE) + +##Combining data and metadata +data3 <- merge(data2, samples, by="sample") + +dim_data3 <- dim(data3) +dim_samples <- dim(samples) +cols <- dim_data3[2]-dim_samples[2]+1 +first <-colnames(data3)[2] +last <- colnames(data3)[cols] +data3[,2:cols] <- lapply(data3[,2:cols], as.character) +data3[,2:cols] <- lapply(data3[,2:cols], as.numeric) + +#Calculate total reads per sample +data4 <- data3%>% + mutate(sum=select(.,2:cols)%>% + apply(1, sum, na.rm=TRUE)) + +``` + + +

  Number of Reads Per Sample

+ +```{r plot, echo=FALSE} +# sample and count +con <- plot_ly(data4, x = ~sum, y = ~sample, name = "Sample", type = 'scatter', + mode = "markers", marker = list(color = "#088da5"), hovertemplate = paste('Sample: %{y}','
Total reads: %{x}','')) +con <- con %>% layout(xaxis = list(title = "Total reads"),yaxis = list(title = "Sample")) +con <- con %>% config(toImageButtonOptions=list(format='svg',filename='Counts_per_sample', height= 500, width= 800, scale= 1)) +con +``` +
+
+
+```{r filter_data, include=FALSE} +##Filter samples with low reads +nfil=params$minimumCounts +#nfil=1000 +data5 <- data4 %>% + filter(sum>nfil) + #can cause errors +data5dim <-dim(data5) +minreads<-min(data5$sum) + +``` +
+
+
+
+
+ +

  Rarefaction Curves

+```{r rarefaction, echo=FALSE, cache=FALSE} +##Rarefaction curves +rarefaction <- rarecurve(data5[,2:cols]) + +##rarefied dataset +raredata <- as.data.frame(rrarefy(data5[,2:cols], sample=minreads)) +``` +
+
+
+
+
+ + +

  Diversity Analyses Plots

+ +
+
+ +### Shannon diversity + +
+```{r diversity_analysis, echo=FALSE} +metadata <- data5[,(cols+1):data5dim[2]] +metadata$sample <- data5$sample +index <-diversity(raredata, index= "shannon") +shannondata5 <- as.data.frame(index) +shannondata5$sample<- data5$sample +shannondata5_2 <- merge(shannondata5, metadata, by="sample") + +sh <- plot_ly(shannondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) +#sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) +sh <- sh %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) +sh <- sh %>% config(toImageButtonOptions=list(format='svg',filename='ShannonDiv', height= 500, width= 800, scale= 1)) +sh + +if (params$stats == "true" ) { + shannonaov <- aov(index ~ treatment, data= shannondata5_2) + st <- shapiro.test(resid(shannonaov)) + bt <- bartlett.test(index ~ treatment, data= shannondata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(shannonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= shannondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(shannonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= shannondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= shannondata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(shannondata5_2$index, shannondata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + } +} else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") +} +``` +
+
+
+
+ +### Simpson diversity + +
+```{r diversity_analysis2, echo=FALSE} +index <- diversity(raredata, index= "simpson") +simpsondata5 <- as.data.frame(index) +simpsondata5$sample<- data5$sample +simpsondata5_2 <- merge(simpsondata5, metadata, by="sample") + +s <- plot_ly(simpsondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) +#s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) +s <- s %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) +s <- s %>% config(toImageButtonOptions=list(format='svg',filename='SimpsonDiv', height= 500, width= 800, scale= 1)) +s + +if (params$stats == "true" ) { + simpsonaov <- aov(index ~ treatment, data= simpsondata5_2) + st <- shapiro.test(resid(simpsonaov)) + bt <- bartlett.test(index ~ treatment, data= simpsondata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(simpsonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= simpsondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(simpsonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= simpsondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= simpsondata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(simpsondata5_2$index, simpsondata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + } +} else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") +} +``` +
+
+
+
+ +### Richness + +
+```{r diversity_analysis3, echo=FALSE} +mind5<-min(data5$sum) +index <- rarefy(data5[,2:cols], sample=mind5) +rarerichnessdata5 <- as.data.frame(index) +rarerichnessdata5$sample <-data5$sample +richdata5_2 <- merge(rarerichnessdata5, metadata, by="sample") + +ri <- plot_ly(richdata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) +#ri <- ri %>% layout(title = list(text="ASV Richness",y=.99)) +ri <- ri %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) +ri <- ri %>% config(toImageButtonOptions=list(format='svg',filename='SpeciesRich', height= 500, width= 800, scale= 1)) +ri + +if (params$stats == "true" ) { + richaov <- aov(index ~ treatment, data= richdata5_2) + st <- shapiro.test(resid(richaov)) + bt <- bartlett.test(index ~ treatment, data= richdata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(richaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= richdata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(richaov)) + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + writeLines("\n--------------------------------------------------------------\n") + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(richaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(richaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= richdata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= richdata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(richdata5_2$index, richdata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(richaov)) + writeLines("\n--------------------------------------------------------------\n") + } +} else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") +} +``` +
+
+
+
+
+ +

  Distance To Centroid

+
+```{r distance, echo=FALSE} +##Distance +intermediate <- raredata +bray.distance <- vegdist(sqrt(intermediate), method="bray") + +##Dispersion +disper <- betadisper(bray.distance, group = metadata$treatment, type="centroid") +df <- data.frame(Distance_to_centroid=disper$distances,Group=disper$group) +df$sample <- data5$sample +df2 <- merge(df, metadata, by="sample") + +cen <- plot_ly(df2, x=~treatment, y=~Distance_to_centroid, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) +#cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) +cen <- cen %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Distance"), xaxis=list(title = "Treatment")) +cen <- cen %>% config(toImageButtonOptions=list(format='svg',filename='Dispersion', height= 500, width= 800, scale= 1)) +cen + +if (params$stats == "true" ) { + adn <- adonis(bray.distance~data5$treatment) + adn +} else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") +} +``` +
+
+
+
+
+ +

  NMDS Plots

+ +
+ +### 2D NMDS + +
+```{r nmds2d, echo=FALSE} +##NMDS +datax <- decostand(raredata, method="total") #method 'total' normalizes data to sum up to 1 --data5[,2:cols] + +MDS <- metaMDS(sqrt(datax), + distance = "bray",autotransform = FALSE, + k = 2, + maxit = 999, + trymax = params$try, + wascores = TRUE) + +if (MDS$converged == "TRUE") { + +data.scores <- as.data.frame(scores(MDS)) +data.scores$sample <- data5$sample +data.scores.2 <- merge(data.scores, metadata, by="sample") + +p <- ggplot(data.scores.2, aes(x=NMDS1, y=NMDS2,color=treatment))+ + geom_point(size=2)+ + theme_classic()+ + theme(legend.title = element_blank()) + +fff <-plot_ly(data.scores.2, x=~NMDS1, y=~NMDS2, color=~treatment, colors=mycol, text = ~paste("Sample: ", sample)) +fff <- fff %>% layout(legend=list(y=.5)) +fff <- fff %>% config(toImageButtonOptions=list(format='svg',filename='2Dnmds', height= 500, width= 800, scale= 1)) +fff + +} else { +print("NMDS did not converge. Printing PCoA") +#calculate bray curtis distance +bcdist <- vegdist(datax, "bray") +res <- pcoa(bcdist) +#res$values +#biplot(res) #to make the boring 2D PCoA +comp <- as.data.frame(res$vectors) +comp$sample <- data5$sample +comp2 <- merge(comp, metadata, by="sample") +fig <- plot_ly(comp2, x = ~Axis.1, y = ~Axis.2, color= ~treatment, text = ~paste("Sample: ", sample,"
Treatment: ",treatment)) +fig <- fig %>% layout(xaxis = list(title = "PCoA 1"), + yaxis = list(title = "PCoA 2")) +fig <- fig %>% layout(legend=list(y=.5)) +fig <- fig %>% config(toImageButtonOptions=list(format='svg',filename='2D_PCoA', height= 500, width= 800, scale= 1)) +fig +} +``` + +
+
+
+
+ +### 3D NMDS + +
+ +```{r nmds3d, echo=FALSE} +MDS3 <- metaMDS(sqrt(datax), + distance = "bray",autotransform = FALSE, + k = 3, + maxit = 999, + trymax = params$try, + wascores = TRUE) + +if (MDS3$converged == "TRUE") { + +data.scores3 <- as.data.frame(scores(MDS3)) +data.scores3$sample <- data5$sample +data.scores.3 <- merge(data.scores3, metadata, by="sample") +p3d <- plot_ly(data.scores.3, x=~NMDS1, y=~NMDS2, z=~NMDS3, text=~paste("Sample: ", sample), + color=~treatment, colors=mycol, + mode = 'markers', symbol = ~treatment, symbols = c('square','circle'), + marker = list(opacity = .8,line=list(color = 'darkblue',width = 1)) + ) +p3d <- p3d %>% layout(legend=list(y=.5)) +p3d <- p3d %>% config(toImageButtonOptions=list(format='svg',filename='3Dnmds', height= 500, width= 800, scale= 1)) +p3d + +} else { +print("NMDS did not converge. Printing PCoA") +#calculate bray curtis distance +bcdist <- vegdist(datax, "bray") +res <- pcoa(bcdist) +comp <- as.data.frame(res$vectors) +comp$sample <- data5$sample +comp2 <- merge(comp, metadata, by="sample") +fig <- plot_ly(comp2, x = ~Axis.1, y = ~Axis.2, z = ~Axis.3, color= ~treatment, text = ~paste("Sample: ", sample,"
Treatment: ",treatment)) +fig <- fig %>% layout(xaxis = list(title = "PCoA 1"), + yaxis = list(title = "PCoA 2"), + zaxis = list(title = "PCoA 3")) +fig <- fig %>% layout(legend=list(y=.5)) +fig <- fig %>% config(toImageButtonOptions=list(format='svg',filename='3D_PCoA', height= 500, width= 800, scale= 1)) +fig +} +``` + + +
+
+
+
+
+ +

  Relative Abundance Per Sample

+ +```{r long, echo=FALSE} +dataz <- decostand(data5[,2:cols],method="total") #method 'total' normalizes data to sum up to 1 +dataz$sample <- data5$sample +datay <- merge(dataz, metadata, by="sample") +datalong <- datay %>% + tidyr::gather(first:last, key=hit, value=reads) + +ddd <- plot_ly(datalong, x=~sample, y=~reads, color=~hit, colors=mycol) +ddd <- ddd %>% layout(type='bar',barmode = 'stack') +ddd <- ddd %>% layout(legend = list(x=10,y=.5), xaxis=list(title = "Sample"), yaxis=list(title = "Relative abundance")) +ddd <- ddd %>% config(toImageButtonOptions=list(format='svg',filename='Relative_abundance', height= 500, width= 800, scale= 1)) +ddd +``` + + +
+
+
+
+
+ +

  Sequence Abundance Per Treatment

+ +```{r asv_barplot, echo=FALSE} +datalong <- datalong %>% + filter(reads>0) + +asp2 <- plot_ly(datalong, y=~hit, x=~reads, color=~treatment, colors=mycol, text = ~paste("Sample: ", sample), opacity=.9) +asp2 <- asp2 %>% layout(type='bar', barmode = 'group') +asp2 <- asp2 %>% layout(yaxis = list(title = '', categoryorder = "total ascending"), legend = list(x=10,y=.5)) +asp2 <- asp2 %>% config(toImageButtonOptions=list(format='svg',filename='Most_abundant_hits_per_sample', height= 500, width= 800, scale= 1)) +asp2 +``` + +
+
+
+
+
+ +

  Pairwise Percent Similarity Heatmap

+ +
+
+```{r heatmap, echo=FALSE} +simmatrix<- read.csv("matrix.txt", header=FALSE) +rownames(simmatrix) <- simmatrix[,1] +simmatrix <- simmatrix[,-1] +colnames(simmatrix) <-rownames(simmatrix) +cols <- dim(simmatrix)[2] +simmatrix$AA <- rownames(simmatrix) +rval=nrow(simmatrix) +simmatrix2 <- simmatrix %>% + gather(1:rval, key=sequence, value=similarity) +x=reorder(simmatrix2$AA,simmatrix2$similarity) +y=reorder(simmatrix2$sequence,simmatrix2$similarity) +similaritymatrix <- ggplot(simmatrix2, aes(x=x, y=y,fill=similarity))+ + geom_raster()+ + scale_fill_distiller(palette="Spectral")+ + theme(axis.text.x = element_text(angle = 90))+ + theme(axis.title.x=element_blank())+ + theme(axis.title.y=element_blank()) + +heat <- ggplotly(similaritymatrix) +heat <- heat %>% config(toImageButtonOptions=list(format='svg',filename='heatmap', height= 500, width= 800, scale= 1)) +heat +``` +
+
+
+
+
+ + +

  Taxonomy Results Visualization

+ +
+
+ +```{r taxonomy, echo=FALSE} +if (params$skipTaxonomy == "true") { + # file fore here + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="tax.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else { + tax=read.csv("sum.csv",header=F) + tp <- plot_ly(tax, labels = ~V1, values = ~V2) + tp <- tp %>% add_pie(marker=list(colors=mycol, line=list(color='#000000', width=.5)), hole = 0.6) + tp <- tp %>% layout(title = "Taxonomy distribution", showlegend = F, + xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), + yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE) + ) + tp <- tp %>% config(toImageButtonOptions=list(format='svg',filename='TaxDonut', height= 500, width= 800, scale= 1)) + tp +} + +``` + +
+
+
+
+
+ +

  Phylogenetic Tree

+ +
+
+ +```{r tree, echo=FALSE} +if (params$skipPhylogeny == "true") { + # file fore here + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="tree.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType") && (params$nodeCol == "MED")) { + tree=read.tree("tree.txt") + p1 <- ggtree(tree) + id <- tree$tip.label + dat <- tibble::tibble(id = id) + metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id')) + treelist=read.csv("tree_group.csv") + metat2 <- metat %>% dplyr::inner_join(treelist, c('label' = 'OTU_ID')) + p2 <- p1 + geom_point(data = metat2, aes(x = x, y = y, label = id, color = GroupID)) + gg = ggplotly(p2, tooltip = c("label","GroupID")) + gg <- gg %>% config(toImageButtonOptions=list(format='svg',filename='Tree_MED_colors', height= 500, width= 800, scale= 1)) + gg +} else if ((params$skipTaxonomy == "false") && (params$nodeCol == "TAX")) { + taxMap=read.csv("quicker_taxbreakdown.csv", header=T, check.names = F) + tree=read.tree("tree.txt") + p1 <- ggtree(tree) + id <- tree$tip.label + dat <- tibble::tibble(id = id) + metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id')) + colnames(taxMap) <- c("label","tax") + taxMap + taxDF <- merge(metat, taxMap, by="label") + p2 <- p1 + geom_point(data = taxDF, aes(x = x, y = y, label = label, color = tax)) + gg = ggplotly(p2, tooltip = c("label","tax")) + gg <- gg %>% config(toImageButtonOptions=list(format='svg',filename='Tree_Tax_colors', height= 500, width= 800, scale= 1)) + gg +} else { + #tree=read.newick("vAMPrun_otu.55.raxml.support") + tree=read.tree("tree.txt") + p1 <- ggtree(tree) + id <- tree$tip.label + dat <- tibble::tibble(id = id) + metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id')) + p2 <- p1 + geom_point(data = metat, aes(x = x, y = y, label = id, color = id)) + gg = ggplotly(p2, tooltip = "label") + gg <- gg %>% config(toImageButtonOptions=list(format='svg',filename='Tree_id_colors', height= 500, width= 800, scale= 1)) + gg +} + +``` +This tree is a maximum likelihood tree made with IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL. If you ran the MED analysis, the colors of the nodes correspond to the MED group they were assigned to. + +
+
+
+
+
+ +# Post Minimum Entropy Decomposition (MED) Analyses + +

  MED Group Breakdown Table

+ +
+
+ +```{bash bash_med_table, include=FALSE} +if [[ $(ls | grep -wc "asv_medfile.csv") -eq 1 ]];then + awk -F "," '{print $2}' asv_medfile.csv | sort | uniq | sort -g >group.list + for x in $(cat group.list);do + echo "${x}" >> ${x}.col + grep -w "$x" asv_medfile.csv | awk -F "," '{print $1}' | sort >> ${x}.col + done + paste -d "," *.col > asv_medtable.csv + rm *.col +else + echo "Not MED" +fi + +if [[ $(ls | grep -wc "amino_medfile.csv") -eq 1 ]];then + awk -F "," '{print $2}' amino_medfile.csv | sort | awk -F "Group" '{print $2}' | sort -n | uniq > group.list + for x in $(cat group.list);do + gr="Group"${x}"" + echo "${gr}" > ${x}.col + grep -w "$gr" amino_medfile.csv | awk -F "," '{print $1}' | sort >> ${x}.col + done + paste -d "," *.col > amino_medtable.csv + rm *.col +else + echo "Not MED" +fi +``` + +```{r med_table, echo=FALSE} +if (params$asvMED == "true" && params$type == "ASV"){ + #med_group=read.csv("asv_medfile.csv", header = F) + #colnames(med_group) <- c("SequenceID", "Group", "Sequence") + # change name of sequence to med peak or something + #paged_table(med_group,options = list(rows.print = 20)) + med_group=read.csv("asv_medtable.csv", header = T) + knitr::kable(med_group, digits = 2, align = 'c', booktabs = TRUE, caption = "Table: ASV MED Table") %>% + kable_styling(font_size = 12, full_width = F)%>% + scroll_box(width = "100%", height = "100%") +} + +if (params$aminoMED == "true" && params$type == "AminoType"){ + #med_group=read.csv("amino_medfile.csv", header = F) + #colnames(med_group) <- c("SequenceID", "Group", "Sequence") + #paged_table(med_group,options = list(rows.print = 20)) + med_group=read.csv("amino_medtable.csv", header = T) + knitr::kable(med_group, digits = 2, align = 'c', booktabs = TRUE, caption = "Table: Aminotype MED Table") %>% + kable_styling(font_size = 12, full_width = F) %>% + scroll_box(width = "100%", height = "100%") +} +``` + +
+
+
+
+
+ +

  Post-MED Relative Abundance Plot

+ +
+
+ +```{r med_plot, echo=FALSE} +if (params$asvMED == "true" && params$type == "ASV"){ + #medgroup=read.csv("uVP_AminoType_Groupingcounts.csv", check.names=FALSE) + medgroup=read.csv("asv_groupcounts.csv", check.names=FALSE) + meddata <- medgroup %>% select(-2) + meddata2 <- aggregate(.~GroupID, meddata, sum) + meddata3 <-as.data.frame(t(meddata2)) + meddata3$sample <- row.names(meddata3) + colnames(meddata3)<- as.matrix(meddata3[1,]) + as.data.frame(meddata3) + meddata3 <- meddata3[-1,] + meddata3 <- meddata3 %>% + rename(sample=GroupID) + meddata3dim <- dim(meddata3) + ##Loading metadata + samples <- read.csv(sample_metadata, header = TRUE) + #samples <- read.csv("rna_virus_meta.csv", header = TRUE) + ##Combining data and metadata + meddata4 <- merge(meddata3, samples, by="sample") + #if "meddata4" removed it does not work + #meddata4 + dim_meddata4 <- dim(meddata4) + dim_samples <- dim(samples) + cols <- dim_meddata4[2]-dim_samples[2]+1 + first <-colnames(meddata4)[2] + last <- colnames(meddata4)[cols] + meddata4[,2:cols] <- lapply(meddata4[,2:cols], as.character) + meddata4[,2:cols] <- lapply(meddata4[,2:cols], as.numeric) + #Calculate total reads per sample + meddata5 <- meddata4%>% + mutate(sum=select(.,2:cols)%>% + apply(1, sum, na.rm=TRUE)) + #meddata5 + ##Filter samples with low reads + nfil=params$minimumCounts + #nfil=1000 + meddata5 <- meddata5 %>% + filter(sum>nfil) + #can cause errors + meddata5dim <-dim(meddata5) + minreads<-min(meddata5$sum) + + dataz <- decostand(meddata5[,2:cols],method="total") #method 'total' normalizes data to sum up to 1 + dataz$sample <- meddata5$sample + metadata <- meddata5[,(cols+1):meddata5dim[2]] + metadata$sample <- meddata5$sample + datay <- merge(dataz, metadata, by="sample") + datalong <- datay %>% + tidyr::gather(first:last, key=hit, value=reads) + ddd <- plot_ly(datalong, x=~sample, y=~reads, color=~hit, colors=mycol, type='bar') + ddd <- ddd %>% layout(barmode = 'stack') + ddd <- ddd %>% layout(legend = list(x=10,y=.5), xaxis=list(title = "Sample"), yaxis=list(title = "Relative abundance")) + ddd <- ddd %>% config(toImageButtonOptions=list(format='svg',filename='Relative_abundance', height= 500, width= 800, scale= 1)) + ddd +} +if (params$aminoMED == "true" && params$type == "AminoType"){ + #medgroup=read.csv("uVP_AminoType_Groupingcounts.csv") + medgroup=read.csv("amino_groupcounts.csv", check.names=FALSE) + meddata <- medgroup %>% select(-2) + meddata2 <- aggregate(.~GroupID, meddata, sum) + meddata3 <-as.data.frame(t(meddata2)) + meddata3$sample <- row.names(meddata3) + colnames(meddata3)<- as.matrix(meddata3[1,]) + as.data.frame(meddata3) + meddata3 <- meddata3[-1,] + meddata3 <- meddata3 %>% + rename(sample=GroupID) + meddata3dim <- dim(meddata3) + ##Loading metadata + samples <- read.csv(sample_metadata, header = TRUE) + ##Combining data and metadata + meddata4 <- merge(meddata3, samples, by="sample") + #if "meddata4" removed it does not work + #meddata4 + dim_meddata4 <- dim(meddata4) + dim_samples <- dim(samples) + cols <- dim_meddata4[2]-dim_samples[2]+1 + first <-colnames(meddata4)[2] + last <- colnames(meddata4)[cols] + meddata4[,2:cols] <- lapply(meddata4[,2:cols], as.character) + meddata4[,2:cols] <- lapply(meddata4[,2:cols], as.numeric) + #Calculate total reads per sample + meddata5 <- meddata4%>% + mutate(sum=select(.,2:cols)%>% + apply(1, sum, na.rm=TRUE)) + #meddata5 + ##Filter samples with low reads + nfil=params$minimumCounts + #nfil=1000 + meddata5 <- meddata5 %>% + filter(sum>nfil) + #can cause errors + meddata5dim <-dim(meddata5) + minreads<-min(meddata5$sum) + + dataz <- decostand(meddata5[,2:cols],method="total") #method 'total' normalizes data to sum up to 1 + dataz$sample <- meddata5$sample + metadata <- meddata5[,(cols+1):meddata5dim[2]] + metadata$sample <- meddata5$sample + datay <- merge(dataz, metadata, by="sample") + datalong <- datay %>% + tidyr::gather(first:last, key=hit, value=reads) + ddd <- plot_ly(datalong, x=~sample, y=~reads, color=~hit, colors=mycol, type='bar') + ddd <- ddd %>% layout(barmode = 'stack') + ddd <- ddd %>% layout(legend = list(x=10,y=.5), xaxis=list(title = "Sample"), yaxis=list(title = "Relative abundance")) + ddd <- ddd %>% config(toImageButtonOptions=list(format='svg',filename='Relative_abundance', height= 500, width= 800, scale= 1)) + ddd +} +``` + +
+
+
+
+
+ +

  Post-Med Rarefaction Curves

+ +
+
+ +```{r med_rarefaction, echo=FALSE, cache=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ +##Rarefaction curves +rarefaction <- rarecurve(meddata5[,2:cols]) + +##rarefied dataset +raredata <- as.data.frame(rrarefy(meddata5[,2:cols], sample=minreads)) +} +``` + +
+
+
+
+
+
+ +

  Post-MED Diversity Analyses Plots

+ +
+
+ +### Post-MED Shannon diversity + +
+```{r med_diversity_analysis1a, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + metadata <- meddata5[,(cols+1):meddata5dim[2]] + metadata$sample <- meddata5$sample + index <-diversity(raredata, index= "shannon") + shannondata5 <- as.data.frame(index) + shannondata5$sample<- meddata5$sample + shannondata5_2 <- merge(shannondata5, metadata, by="sample") + + sh <- plot_ly(shannondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) + #sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) + sh <- sh %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) + sh <- sh %>% config(toImageButtonOptions=list(format='svg',filename='ShannonDiv', height= 500, width= 800, scale= 1)) + sh +} +``` + +```{r med_diversity_analysis1b, echo=FALSE} +#divided because error +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + if (params$stats == "true" ) { + shannonaov <- aov(index ~ treatment, data= shannondata5_2) + st <- shapiro.test(resid(shannonaov)) + bt <- bartlett.test(index ~ treatment, data= shannondata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(shannonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= shannondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(shannonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= shannondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= shannondata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(shannondata5_2$index, shannondata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(shannonaov)) + writeLines("\n--------------------------------------------------------------\n") + } + } else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") + } +} +``` +
+
+
+
+ +### Post-MED Simpson diversity + +
+```{r med_diversity_analysis2a, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + index <- diversity(raredata, index= "simpson") + simpsondata5 <- as.data.frame(index) + simpsondata5$sample<- meddata5$sample + simpsondata5_2 <- merge(simpsondata5, metadata, by="sample") + + s <- plot_ly(simpsondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) + #s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) + s <- s %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) + s <- s %>% config(toImageButtonOptions=list(format='svg',filename='SimpsonDiv', height= 500, width= 800, scale= 1)) + s +} +``` + +```{r med_diversity_analysis2b, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + if (params$stats == "true" ) { + simpsonaov <- aov(index ~ treatment, data= simpsondata5_2) + st <- shapiro.test(resid(simpsonaov)) + bt <- bartlett.test(index ~ treatment, data= simpsondata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(simpsonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= simpsondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(simpsonaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= simpsondata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= simpsondata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(simpsondata5_2$index, simpsondata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(simpsonaov)) + writeLines("\n--------------------------------------------------------------\n") + } + } else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") + } +} +``` +
+
+
+
+ +### Post-MED Richness + +
+```{r med_diversity_analysis3a, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + mind5<-min(meddata5$sum) + index <- rarefy(meddata5[,2:cols], sample=mind5) + rarerichnessdata5 <- as.data.frame(index) + rarerichnessdata5$sample <-meddata5$sample + richdata5_2 <- merge(rarerichnessdata5, metadata, by="sample") + + ri <- plot_ly(richdata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) + #ri <- ri %>% layout(title = list(text="ASV Richness",y=.99)) + ri <- ri %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) + ri <- ri %>% config(toImageButtonOptions=list(format='svg',filename='SpeciesRich', height= 500, width= 800, scale= 1)) + ri +} +``` + +```{r med_diversity_analysis3b, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + if (params$stats == "true" ) { + richaov <- aov(index ~ treatment, data= richdata5_2) + st <- shapiro.test(resid(richaov)) + bt <- bartlett.test(index ~ treatment, data= richdata5_2) + + if (st$p.value > .05 && bt$p.value > .05) { + print("Shapiro Test of normality - data is normal p-value > 0.05") + print(shapiro.test(resid(richaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") + print(bartlett.test(index ~ treatment, data= richdata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA Results") + print(summary(richaov)) + #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 + writeLines("\n--------------------------------------------------------------\n") + print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") + print(TukeyHSD(richaov)) + writeLines("\n--------------------------------------------------------------\n") + } else { + print("Shapiro Test of normality - data is normal if p-value > 0.05") + print(shapiro.test(resid(richaov))) + writeLines("\n--------------------------------------------------------------\n") + print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") + print(bartlett.test(index ~ treatment, data= richdata5_2)) + writeLines("\n--------------------------------------------------------------\n") + print("Data either not normal or variance not homogenous") + print("Kruskal-Wallis Test - test significant if p <.05") + #Kruskal-Wallis test - significant p <.05 + mykt <- kruskal.test(index ~ treatment, data= richdata5_2) + print(mykt) + writeLines("\n--------------------------------------------------------------\n") + if (mykt$p.value < .05) { + #Pairwise comparison + print("Wilcox.test - pairwise comparison") + print(pairwise.wilcox.test(richdata5_2$index, richdata5_2$treatment, p.adjust.method = "BH")) + } else { + print("Data not significant. Skipping pairwise comparison") + } + writeLines("\n--------------------------------------------------------------\n") + print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") + print(summary(richaov)) + writeLines("\n--------------------------------------------------------------\n") + } + } else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") + } +} +``` +
+
+
+
+
+ +

  Post-MED Distance To Centroid

+
+```{r med_distance2a, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + ##Distance + intermediate <- raredata + bray.distance <- vegdist(sqrt(intermediate), method="bray") + + ##Dispersion + disper <- betadisper(bray.distance, group = metadata$treatment, type="centroid") + df <- data.frame(Distance_to_centroid=disper$distances,Group=disper$group) + df$sample <- meddata5$sample + df2 <- merge(df, metadata, by="sample") + + cen <- plot_ly(df2, x=~treatment, y=~Distance_to_centroid, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) + #cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) + cen <- cen %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Distance"), xaxis=list(title = "Treatment")) + cen <- cen %>% config(toImageButtonOptions=list(format='svg',filename='Dispersion', height= 500, width= 800, scale= 1)) + cen +} +``` +```{r med_distance2b, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + if (params$stats == "true" ) { + adn <- adonis(bray.distance~meddata5$treatment) + adn + } else { + print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") + } +} +``` +
+
+
+
+
+ +

  NMDS Plots

+ +
+ +### Post-MED 2D NMDS + +
+```{r med_nmds2d, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + ##NMDS + datax <- decostand(raredata,method="total") #method 'total' normalizes data to sum up to 1 --data5[,2:cols] + + MDS <- metaMDS(sqrt(datax), + distance = "bray",autotransform = FALSE, + k = 2, + maxit = 999, + trymax = params$try, + wascores = TRUE) + + if (MDS$converged == "TRUE") { + + data.scores <- as.data.frame(scores(MDS)) + data.scores$sample <- meddata5$sample + data.scores.2 <- merge(data.scores, metadata, by="sample") + + p <- ggplot(data.scores.2, aes(x=NMDS1, y=NMDS2,color=treatment))+ + geom_point(size=2)+ + theme_classic()+ + theme(legend.title = element_blank()) + + #fff <- ggplotly(p) + #fff <- fff %>% layout(legend=list(y=.5)) + #fff + + fff <-plot_ly(data.scores.2, x=~NMDS1, y=~NMDS2, color=~treatment, colors=mycol, text = ~paste("Sample: ", sample)) + fff <- fff %>% layout(legend=list(y=.5)) + fff <- fff %>% config(toImageButtonOptions=list(format='svg',filename='2Dnmds', height= 500, width= 800, scale= 1)) + fff + + } else { + print("NMDS did not converge. Printing PCoA") + #calculate bray curtis distance + bcdist <- vegdist(datax, "bray") + res <- pcoa(bcdist) + #res$values + #biplot(res) #to make the boring 2D PCoA + comp <- as.data.frame(res$vectors) + comp$sample <- data5$sample + comp2 <- merge(comp, metadata, by="sample") + fig <- plot_ly(comp2, x = ~Axis.1, y = ~Axis.2, color= ~treatment, text = ~paste("Sample: ", sample,"
Treatment: ",treatment)) + fig <- fig %>% layout(xaxis = list(title = "PCoA 1"), + yaxis = list(title = "PCoA 2")) + fig <- fig %>% layout(legend=list(y=.5)) + fig <- fig %>% config(toImageButtonOptions=list(format='svg',filename='2D_PostMED_PCoA', height= 500, width= 800, scale= 1)) + fig + } +} +``` + +
+
+
+
+ +### Post-MED 3D NMDS + +
+ +```{r med_nmds3d, echo=FALSE} +if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")){ + MDS3 <- metaMDS(sqrt(datax), + distance = "bray",autotransform = FALSE, + k = 3, + maxit = 999, + trymax = params$try, + wascores = TRUE) + + if (MDS3$converged == "TRUE") { + + data.scores3 <- as.data.frame(scores(MDS3)) + data.scores3$sample <- meddata5$sample + data.scores.3 <- merge(data.scores3, metadata, by="sample") + p3d <- plot_ly(data.scores.3, x=~NMDS1, y=~NMDS2, z=~NMDS3, text=~paste("Sample: ", sample), + color=~treatment, colors=mycol, + mode = 'markers', symbol = ~treatment, symbols = c('square','circle'), + marker = list(opacity = .8,line=list(color = 'darkblue',width = 1)) + ) + p3d <- p3d %>% layout(legend=list(y=.5)) + p3d <- p3d %>% config(toImageButtonOptions=list(format='svg',filename='3Dnmds', height= 500, width= 800, scale= 1)) + p3d + + } else { + print("NMDS did not converge. Printing PCoA") + #calculate bray curtis distance + bcdist <- vegdist(datax, "bray") + res <- pcoa(bcdist) + comp <- as.data.frame(res$vectors) + comp$sample <- data5$sample + comp2 <- merge(comp, metadata, by="sample") + fig <- plot_ly(comp2, x = ~Axis.1, y = ~Axis.2, z = ~Axis.3, color= ~treatment, text = ~paste("Sample: ", sample,"
Treatment: ",treatment)) + fig <- fig %>% layout(xaxis = list(title = "PCoA 1"), + yaxis = list(title = "PCoA 2"), + zaxis = list(title = "PCoA 3")) + fig <- fig %>% layout(legend=list(y=.5)) + fig <- fig %>% config(toImageButtonOptions=list(format='svg',filename='3D_PCoA', height= 500, width= 800, scale= 1)) + fig + } +} +``` + +
+
+
+
+ +

  MED Group Representatives Phylogenetic Tree

+ +
+
+ +```{r med_tree, echo=FALSE} +if (params$skipPhylogeny == "true") { + # file fore here + writeLines("\n--------------------------------------------------------------\n") + cat(readLines(list.files(pattern="tree.txt")) , sep = '\n') + writeLines("\n--------------------------------------------------------------\n") +} else if ((params$asvMED == "true" || params$aminoMED == "true") && (params$type == "ASV" || params$type == "AminoType")) { + #tree=read.newick("vAMPrun_otu.55.raxml.support") + tree=read.tree("grouptree.txt") + p1 <- ggtree(tree) + id <- tree$tip.label + dat <- tibble::tibble(id = id) + #cdat <- tibble::tibble(dcolor = c(1:247)) + metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id')) + + p2 <- p1 + geom_point(data = metat, aes(x = x, y = y, label = id, color=label)) + ggplotly(p2, tooltip = "label") + + treelist=read.csv("tree_group.csv") + colnames(treelist) <- c("GroupID","ID") + + metat2 <- metat %>% dplyr::inner_join(treelist, c('label' = 'GroupID')) + p2 <- p1 + geom_point(data = metat2, aes(x = x, y = y, label = ID, color = label)) + ggplotly(p2, tooltip = "label") + +} +``` +This tree is a maximum likelihood tree made with the representative sequences from the MED groups formed in the oligotyping analysis using IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL. +
+
+
+
+
+
+
+
+
diff --git a/bin/vAMPirus_ReportA.Rmd b/bin/vAMPirus_ReportA.Rmd deleted file mode 100644 index a3bf8f8..0000000 --- a/bin/vAMPirus_ReportA.Rmd +++ /dev/null @@ -1,705 +0,0 @@ ---- -title: "vAMPirus Analyze Report `r commandArgs(trailingOnly=T)[1]`" -date: "Generated on: `r Sys.time()`" -output: html_document -params: - interactive: TRUE - reads: !r commandArgs(trailingOnly=T)[2] # reads - counts: !r commandArgs(trailingOnly=T)[3] # csv - metadata: !r commandArgs(trailingOnly=T)[4] # metadata - filter: !r commandArgs(trailingOnly=T)[5] # filter min counts - heatmap: !r commandArgs(trailingOnly=T)[6] # heatmap - tax: !r commandArgs(trailingOnly=T)[7] # tax - try: !r commandArgs(trailingOnly=T)[8] # trymax - stats: !r commandArgs(trailingOnly=T)[9] # stats ---- - - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE, - message = FALSE, - warning = FALSE, - out.width="100%") -``` - -```{r pathways, echo=FALSE} -knitr::include_graphics("vamplogo.png") -``` - -```{r load_libraries, include=FALSE} - -library(BiocManager) -library(vegan) -#library(rstatix) -library(tidyverse) -library(scales) -library(cowplot) -library(dplyr) -library(ggtree) -library(plotly) -#library(BiocParallel) -library(knitr) -library(kableExtra) #install.packages("kableExtra") -library(rmarkdown) -library(processx) #install.packages("processx") -#register(MulticoreParam(4)) -``` - -```{r colors, include=FALSE} -mycol=c('#088da5','#73cdc8','#ff6f61','#7cb8df','#88b04b','#00a199','#6B5B95','#92A8D1','#b0e0e6','#ff7f50','#088d9b','#E15D44','#e19336') -``` -
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- - -
-

  Pre- and Post-Adapter Removal Read Stats

-
-```{r readstats, echo=FALSE} -reads_stats=read.csv(params$reads) -#reads_stats=read.csv("PVID_final_reads_stats.csv") -paged_table(reads_stats,options = list(rows.print = 20)) - -#test plotly table -fig <- plot_ly( - type = 'table', - header = list(values=names(reads_stats), - align = c('left'), - line = list(width = 1, color = 'black'), - fill = list(color = 'rgb(45, 100, 230)')), - cells = list(values=unname(reads_stats), - line = list(width = 1, color = 'black'), - fill = list(color = 'white'), - align = c('center'), - font = list(color = '#506784', size = 10)) - ) - -#fig -``` -
-
- - -### Total number of reads before and after adapter removal - -```{r readstats_plot, echo=FALSE} -# Plot of reads before and after -ptotal <- plot_ly(typle="box",marker=list(colors=mycol)) -ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_before, name="Reads before filtering") -ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_after, name="Reads after filtering") -#ptotal <- ptotal %>% layout(title=list(text="Number of reads before and after filtering")) -ptotal <- ptotal %>% layout(legend = list(x=10,y=.5)) -ptotal <- ptotal %>% config(toImageButtonOptions=list(format='svg',filename='TotReads_b4_af_adaptrem', height= 500, width= 800, scale= 1)) -ptotal -``` -
- -### Forward (R1) and reverse (R2) read length before and after adapter removal - -```{r readstats_plot2, echo=FALSE} -# Plot of R1 and R2 before and after -pr <- plot_ly(y=reads_stats$R1_before_length, type="box", name="R1 length before") -pr <- pr %>% add_boxplot(y=reads_stats$R1_after_length, name="R1 length after") -pr <- pr %>% add_boxplot(y=reads_stats$R2_before_length, name="R2 length before") -pr <- pr %>% add_boxplot(y=reads_stats$R2_after_length, name="R2 length after") -#pr <- pr %>% layout(title = "R1 and R2 Length") -pr <- pr %>% layout(legend = list(x=10,y=.5)) -pr <- pr %>% config(toImageButtonOptions=list(format='svg',filename='readlen_b4_af_adaptrem', height= 500, width= 800, scale= 1)) -pr -``` -
-
-
-```{r load_datasets, include=FALSE} -sample_name=params$counts -sample_metadata=params$metadata -#sample_name="vAMPset_nOTU.90_counts.csv" -#sample_metadata="meta.csv" -data<- read.csv(sample_name, check.names=FALSE) -data2 <-as.data.frame(t(data)) -data2$sample <- row.names(data2) -colnames(data2)<- as.matrix(data2[1,]) -as.data.frame(data2) -data2 <- data2[-1,] - -#X.OTU.ID for X.Sequence. -data2 <- data2 %>% - rename(sample=OTU_ID) -data2dim <- dim(data2) - -##Loading metadata -samples <- read.csv(sample_metadata, header=TRUE) - -##Combining data and metadata -data3 <- merge(data2, samples, by="sample") - -dim_data3 <- dim(data3) -dim_samples <- dim(samples) -cols <- dim_data3[2]-dim_samples[2]+1 -first <-colnames(data3)[2] -last <- colnames(data3)[cols] -data3[,2:cols] <- lapply(data3[,2:cols], as.character) -data3[,2:cols] <- lapply(data3[,2:cols], as.numeric) - -#Calculate total reads per sample -data4 <- data3%>% - mutate(sum=select(.,2:cols)%>% - apply(1, sum, na.rm=TRUE)) - -``` - - -

  Number of Reads Per Sample

- -```{r plot, echo=FALSE} -# sample and count -con <- plot_ly(data4, x = ~sum, y = ~sample, name = "Sample", type = 'scatter', - mode = "markers", marker = list(color = "#088da5"), hovertemplate = paste('Sample: %{y}','
Total reads: %{x}','')) -con <- con %>% layout(xaxis = list(title = "Total reads"), yaxis = list(title = "Sample")) -con <- con %>% config(toImageButtonOptions=list(format='svg',filename='Counts_per_sample', height= 500, width= 800, scale= 1)) -con -``` -
-
-
-```{r filter_data, include=FALSE} -##Filter samples with low reads -nfil=params$filter -#nfil=1000 -data5 <- data4 %>% - filter(sum>nfil) - #can cause errors -data5dim <-dim(data5) -minreads<-min(data5$sum) -``` -
-
-
-
-
- -

  Rarefaction

-```{r rarefaction, echo=FALSE, cache=FALSE} -##Rarefaction curves -rarefaction <- rarecurve(data5[,2:cols]) - -##rarefied dataset -raredata <- as.data.frame(rrarefy(data5[,2:cols], sample=minreads)) -``` -
-
-
-
-
- - -

  Diversity Analyses Plots

- -
-
- -### Shannon diversity - -
-```{r diversity_analysis, echo=FALSE} -metadata <- data5[,(cols+1):data5dim[2]] -metadata$sample <- data5$sample -index <-diversity(raredata, index= "shannon") -shannondata5 <- as.data.frame(index) -shannondata5$sample<- data5$sample -shannondata5_2 <- merge(shannondata5, metadata, by="sample") - -#shannonplot <- ggplot(shannondata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Index",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#sh<-ggplotly(shannonplot) -#sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) -#sh - -sh <- plot_ly(shannondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) -sh <- sh %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -sh <- sh %>% config(toImageButtonOptions=list(format='svg',filename='ShannonDiv', height= 500, width= 800, scale= 1)) -sh - -if (params$stats == "true" ) { - shannonaov <- aov(index ~ treatment, data= shannondata5_2) - st <- shapiro.test(resid(shannonaov)) - bt <- bartlett.test(index ~ treatment, data= shannondata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(shannonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= shannondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(shannonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= shannondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= shannondata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(shannondata5_2$index, shannondata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
- -### Simpson diversty - -
-```{r diversity_analysis2, echo=FALSE} -index <- diversity(raredata, index= "simpson") -simpsondata5 <- as.data.frame(index) -simpsondata5$sample<- data5$sample -simpsondata5_2 <- merge(simpsondata5, metadata, by="sample") - -#simpsonplot <- ggplot(simpsondata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Index",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#s<-ggplotly(simpsonplot) -#s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) -#s - -s <- plot_ly(simpsondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) -s <- s %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -s <- s %>% config(toImageButtonOptions=list(format='svg',filename='SimpsonDiv', height= 500, width= 800, scale= 1)) -s - -if (params$stats == "true" ) { - simpsonaov <- aov(index ~ treatment, data= simpsondata5_2) - st <- shapiro.test(resid(simpsonaov)) - bt <- bartlett.test(index ~ treatment, data= simpsondata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(simpsonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= simpsondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(simpsonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= simpsondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= simpsondata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(simpsondata5_2$index, simpsondata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
- -### Species Richness - -
-```{r diversity_analysis3, echo=FALSE} -mind5<-min(data5$sum) -index <- rarefy(data5[,2:cols], sample=mind5) -rarerichnessdata5 <- as.data.frame(index) -rarerichnessdata5$sample <-data5$sample -richdata5_2 <- merge(rarerichnessdata5, metadata, by="sample") - -#richnessplot <- ggplot(richdata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Richness",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -ri <- plot_ly(richdata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#ri <- ri %>% layout(title = list(text="ASV Richness",y=.99)) -ri <- ri %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -ri <- ri %>% config(toImageButtonOptions=list(format='svg',filename='SpeciesRich', height= 500, width= 800, scale= 1)) -ri - -if (params$stats == "true" ) { - richaov <- aov(index ~ treatment, data= richdata5_2) - st <- shapiro.test(resid(richaov)) - bt <- bartlett.test(index ~ treatment, data= richdata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(richaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= richdata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(richaov)) - writeLines("\n--------------------------------------------------------------\n") - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(richaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(richaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= richdata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= richdata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(richdata5_2$index, richdata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(richaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
-
- -

  Distance To Centroid

-
-```{r distance, echo=FALSE} -##Distance -intermediate <- raredata -bray.distance <- vegdist(intermediate, method="bray",autotransform=TRUE) - -##Dispersion -disper <- betadisper(bray.distance, group = metadata$treatment, type="centroid") -#anova(disper) -df <- data.frame(Distance_to_centroid=disper$distances,Group=disper$group) -df$sample <- data5$sample -df2 <- merge(df, metadata, by="sample") - -#p<- ggplot(data=df2,aes(x=treatment,y=Distance_to_centroid,colour=treatment))+ -# geom_boxplot(outlier.alpha = 0)+ -# theme_classic()+ -# geom_point(position=position_dodge(width=0.75))+ -# labs(y="Distance",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#cen <- ggplotly(p) -#cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) -#cen - -cen <- plot_ly(df2, x=~treatment, y=~Distance_to_centroid, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) -cen <- cen %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Distance"), xaxis=list(title = "Treatment")) -cen <- cen %>% config(toImageButtonOptions=list(format='svg',filename='Dispersion', height= 500, width= 800, scale= 1)) -cen - -if (params$stats == "true" ) { - adn <- adonis(bray.distance~data5$treatment) - adn -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
-
- -

  NMDS Plots

- -
- -### 2D NMDS - -
-```{r nmds2d, echo=FALSE} -##NMDS -datax <- decostand(raredata,method="total") #method 'total' normalizes data to sum up to 1 --data5[,2:cols] - -MDS <- metaMDS(sqrt(datax), - distance = "bray",autotransform = FALSE, - k = 2, - maxit = 999, - trymax = params$try, - wascores = TRUE) - -if (MDS$converged == "TRUE") { - -data.scores <- as.data.frame(scores(MDS)) -data.scores$sample <- data5$sample -data.scores.2 <- merge(data.scores, metadata, by="sample") - -p <- ggplot(data.scores.2, aes(x=NMDS1, y=NMDS2,color=treatment))+ - geom_point(size=2)+ - theme_classic()+ - theme(legend.title = element_blank()) - - -#fff <- ggplotly(p) -#fff <- fff %>% layout(legend=list(y=.5)) -#fff - -fff <-plot_ly(data.scores.2, x=~NMDS1, y=~NMDS2, color=~treatment, colors=mycol, text = ~paste("Sample: ", sample)) -fff <- fff %>% layout(legend=list(y=.5)) -fff <- fff %>% config(toImageButtonOptions=list(format='svg',filename='2Dnmds', height= 500, width= 800, scale= 1)) -fff - -} else { - print("No Convergence") -} -``` - -
-
-
-
- -### 3D NMDS - -
- -```{r nmds3d, echo=FALSE} -MDS3 <- metaMDS(sqrt(datax), - distance = "bray",autotransform = FALSE, - k = 3, - maxit = 999, - trymax = params$try, - wascores = TRUE) - -if (MDS3$converged == "TRUE") { - -data.scores3 <- as.data.frame(scores(MDS3)) -data.scores3$sample <- data5$sample -data.scores.3 <- merge(data.scores3, metadata, by="sample") -p3d <- plot_ly(data.scores.3, x= ~ NMDS1, y = ~ NMDS2, z = ~ NMDS3, text = ~paste("Sample: ", sample), - color = ~treatment, colors = mycol, - mode = 'markers', symbol = ~treatment, symbols = c('square','circle'), - marker = list(opacity = .8,line=list(color = 'darkblue',width = 1)) - ) -p3d <- p3d %>% layout(legend=list(y=.5)) -p3d <- p3d %>% config(toImageButtonOptions=list(format='svg',filename='3Dnmds', height= 500, width= 800, scale= 1)) -p3d - -} else { - print("No Convergence") -} -``` -
-
-
-
-
- -

  OTU Abundance Per Sample

- -```{r long, echo=FALSE} -dataz <- decostand(data5[,2:cols],method="total") #method 'total' normalizes data to sum up to 1 -dataz$sample <- data5$sample -datay <- merge(dataz, metadata, by="sample") -datalong <- datay %>% - tidyr::gather(first:last, key=hit, value=reads) - -##Barplot -## add better colors -#spec_bar <- ggplot(datalong, aes(x=forcats::fct_reorder(timepoint, as.numeric(as.character(timepoint))), -# y=reads, fill=hit))+ #Consider sqrt-transforming data -# geom_bar(aes(), stat="identity", position="fill")+ -# #coord_polar("y", start=0)+ -# theme_classic()+ -# facet_wrap(colony~treatment, nrow=2)+ -# #coord_flip()+ -# labs(x="Timepoint") #+ theme(legend.position = "none") -#spec_bar <- ggplot(datalong, aes(x=sample,y=reads,fill=hit))+geom_bar(aes(), stat="identity", position="fill")+ -# theme(axis.text.x=element_text(angle=90)) -#ggplotly(spec_bar) -#,'#00b300','#00b3b3','#0059b3','#6600ff','#b800e6','#ff3333','#ff8000','#ffff00','#bf8040','#42bcf5','#b428f5','#2e8d7e','#664e7e','#a4c700','#1aa3ff' - -ddd <- plot_ly(datalong, x=~sample, y=~reads, color=~hit, colors=mycol) -ddd <- ddd %>% layout(type='bar', barmode = 'stack') -ddd <- ddd %>% layout(legend = list(x=10,y=.5), xaxis=list(title = "Sample"), yaxis=list(title = "Relative abundance")) -ddd <- ddd %>% config(toImageButtonOptions=list(format='svg',filename='Relative_abundance', height= 500, width= 800, scale= 1)) -ddd -``` -
-
-
-
-
- -

  OTU Abundance Per Treatment

- -```{r asv_barplot, echo=FALSE} -datalong <- datalong %>% - filter(reads>0) - -#asv_bar <- ggplot(datalong, aes(x=reorder(hit,reads), y=reads, fill=treatment))+ -# geom_bar(stat="identity")+ -# scale_fill_manual(values=c('#088da5','#e19336'))+ #colors not working -# coord_flip()+ -# theme_classic()+ -# theme(axis.title.y = element_blank())+ -# theme(legend.title = element_blank()) - -asp2 <- plot_ly(datalong, y=~hit, x=~reads, color=~treatment, colors=mycol,text = ~paste("Sample: ", sample), opacity=.9) -asp2 <- asp2 %>% layout(type='bar', barmode = 'group') -asp2 <- asp2 %>% layout(yaxis = list(title = '', categoryorder = "total ascending"), legend = list(x=10,y=.5)) -asp2 <- asp2 %>% config(toImageButtonOptions=list(format='svg',filename='Most_abundant_hits_per_sample', height= 500, width= 800, scale= 1)) -asp2 -``` -
-
-
-
-
- -

  Pairwise Percent-ID Heatmap

- -
-
-```{r heatmap, echo=FALSE} -heatdata=params$heatmap -#heatdata="a.csv" -#heatdata="PVID_vAMPtest1_otu.85_PercentID.matrix" -simmatrix<- read.csv(heatdata, header=FALSE) -rownames(simmatrix) <- simmatrix[,1] -simmatrix <- simmatrix[,-1] -colnames(simmatrix) <-rownames(simmatrix) -cols <- dim(simmatrix)[2] -simmatrix$AA <- rownames(simmatrix) -rval=nrow(simmatrix) -simmatrix2 <- simmatrix %>% - gather(1:rval, key=sequence, value=similarity) -x=reorder(simmatrix2$AA,simmatrix2$similarity) -y=reorder(simmatrix2$sequence,simmatrix2$similarity) -similaritymatrix <- ggplot(simmatrix2, aes(x=x, y=y,fill=similarity))+ - geom_raster()+ - scale_fill_distiller(palette="Spectral")+ - theme(axis.text.x = element_text(angle = 90))+ - theme(axis.title.x=element_blank())+ - theme(axis.title.y=element_blank()) - - -heat <- ggplotly(similaritymatrix) -heat <- heat %>% config(toImageButtonOptions=list(format='svg',filename='heatmap', height= 500, width= 800, scale= 1)) -heat -``` -
-
-
-
-
- - -

  Taxonomy Result Visualization

- -
-
- -```{r taxonomy, echo=FALSE} -tax=read.csv(params$tax,header=F) -#tax=read.csv("tax.csv",header=F) -#tax=read.csv("PVID_vAMPtest1_otu.85_summary_for_plot.csv", header=F) -tp <- plot_ly(tax, labels = ~V1, values = ~V2) -tp <- tp %>% add_pie(marker=list(colors=mycol, line=list(color='#000000', width=.5)), hole = 0.6) -tp <- tp %>% layout(title = "Taxonomy distribution", showlegend = F, - xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), - yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE) - ) -tp <- tp %>% config(toImageButtonOptions=list(format='svg',filename='TaxDonut', height= 500, width= 800, scale= 1)) -tp -``` -
-
-
-
-
-
-
-
-
-
diff --git a/bin/vAMPirus_ReportB.Rmd b/bin/vAMPirus_ReportB.Rmd deleted file mode 100644 index 29b0385..0000000 --- a/bin/vAMPirus_ReportB.Rmd +++ /dev/null @@ -1,725 +0,0 @@ ---- -title: "vAMPirus Analyze Report `r commandArgs(trailingOnly=T)[1]`" -date: "Generated on: `r Sys.time()`" -output: html_document -params: - interactive: TRUE - reads: !r commandArgs(trailingOnly=T)[2] # reads - counts: !r commandArgs(trailingOnly=T)[3] # csv - metadata: !r commandArgs(trailingOnly=T)[4] # metadata - filter: !r commandArgs(trailingOnly=T)[5] # filter min counts - heatmap: !r commandArgs(trailingOnly=T)[6] # heatmap - tax: !r commandArgs(trailingOnly=T)[7] # tax - tree: !r commandArgs(trailingOnly=T)[8] # tree - try: !r commandArgs(trailingOnly=T)[9] # trymax - stats: !r commandArgs(trailingOnly=T)[10] # stats - ---- - - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE, - message = FALSE, - warning = FALSE, - out.width="100%") -``` - -```{r pathways, echo=FALSE} -knitr::include_graphics("vamplogo.png") -``` - -```{r load_libraries, include=FALSE} - -library(vegan) -#library(rstatix) -library(tidyverse) -library(scales) -library(cowplot) -library(dplyr) -library(ggtree) -library(plotly) -#library(BiocParallel) -library(knitr) -library(kableExtra) #install.packages("kableExtra") -library(rmarkdown) -library(processx) #install.packages("processx") -#register(MulticoreParam(4)) -``` - -```{r colors, include=FALSE} -mycol=c('#088da5','#73cdc8','#ff6f61','#7cb8df','#88b04b','#00a199','#6B5B95','#92A8D1','#b0e0e6','#ff7f50','#088d9b','#E15D44','#e19336') -``` -
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -
-

  Pre- and Post-Adapter Removal Read Stats

-
-```{r readstats, echo=FALSE} -reads_stats=read.csv(params$reads) -#reads_stats=read.csv("PVID_final_reads_stats.csv") -paged_table(reads_stats,options = list(rows.print = 20)) - -#test plotly table -fig <- plot_ly( - type = 'table', - header = list(values=names(reads_stats), - align = c('left'), - line = list(width = 1, color = 'black'), - fill = list(color = 'rgb(45, 100, 230)')), - cells = list(values=unname(reads_stats), - line = list(width = 1, color = 'black'), - fill = list(color = 'white'), - align = c('center'), - font = list(color = '#506784', size = 10)) - ) - -#fig -``` -
-
- - -### Total number of reads before and after adapter removal - -```{r readstats_plot, echo=FALSE} -# Plot of reads before and after -ptotal <- plot_ly(type="box",marker=list(colors=mycol)) -ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_before, name="Reads before filtering") -ptotal <- ptotal %>% add_boxplot(y=reads_stats$Total_after, name="Reads after filtering") -#ptotal <- ptotal %>% layout(title=list(text="Number of reads before and after filtering")) -ptotal <- ptotal %>% layout(legend = list(x=10,y=.5)) -ptotal <- ptotal %>% config(toImageButtonOptions=list(format='svg',filename='TotReads_b4_af_adaptrem', height= 500, width= 800, scale= 1)) -ptotal -``` -
- -### Forward (R1) and reverse (R2) read length before and after adapter removal - -```{r readstats_plot2, echo=FALSE} -# Plot of R1 and R2 before and after -pr <- plot_ly(y=reads_stats$R1_before_length, type="box", name="R1 length before") -pr <- pr %>% add_boxplot(y=reads_stats$R1_after_length, name="R1 length after") -pr <- pr %>% add_boxplot(y=reads_stats$R2_before_length, name="R2 length before") -pr <- pr %>% add_boxplot(y=reads_stats$R2_after_length, name="R2 length after") -#pr <- pr %>% layout(title = "R1 and R2 Length") -pr <- pr %>% layout(legend = list(x=10,y=.5)) -pr <- pr %>% config(toImageButtonOptions=list(format='svg',filename='readlen_b4_af_adaptrem', height= 500, width= 800, scale= 1)) -pr -``` -
-
-
-```{r load_datasets, include=FALSE} -sample_name=params$counts -sample_metadata=params$metadata -#sample_name="PVID_vAMPtest1_otu.85_counts.csv" -#sample_metadata="PVID_fiscesTestmeta.csv" -#sample_name="vAMPrun_otu.55_counts.csv" -#sample_metadata="pvid_samples.csv" -data<- read.csv(sample_name, check.names=FALSE) -data2 <-as.data.frame(t(data)) -data2$sample <- row.names(data2) -colnames(data2)<- as.matrix(data2[1,]) -as.data.frame(data2) -data2 <- data2[-1,] - -#X.OTU.ID for X.Sequence. -data2 <- data2 %>% - rename(sample=OTU_ID) -data2dim <- dim(data2) - -##Loading metadata -samples <- read.csv(sample_metadata) - -##Combining data and metadata -data3 <- merge(data2, samples, by="sample") - -dim_data3 <- dim(data3) -dim_samples <- dim(samples) -cols <- dim_data3[2]-dim_samples[2]+1 -first <-colnames(data3)[2] -last <- colnames(data3)[cols] -data3[,2:cols] <- lapply(data3[,2:cols], as.character) -data3[,2:cols] <- lapply(data3[,2:cols], as.numeric) - -#Calculate total reads per sample -data4 <- data3%>% - mutate(sum=select(.,2:cols)%>% - apply(1, sum, na.rm=TRUE)) - -``` - - -

  Number of Reads Per Sample

- -```{r plot, echo=FALSE} -# sample and count -con <- plot_ly(data4, x = ~sum, y = ~sample, name = "Sample", type = 'scatter', - mode = "markers", marker = list(color = "#088da5"), hovertemplate = paste('Sample: %{y}','
Total reads: %{x}','')) -con <- con %>% layout(xaxis = list(title = "Total reads"),yaxis = list(title = "Sample")) -con <- con %>% config(toImageButtonOptions=list(format='svg',filename='Counts_per_sample', height= 500, width= 800, scale= 1)) -con -``` -
-
-
-```{r filter_data, include=FALSE} -##Filter samples with low reads -nfil=params$filter -data5 <- data4 %>% - filter(sum>nfil) - #can cause errors -data5dim <-dim(data5) -minreads<-min(data5$sum) -``` -
-
-
-
-
- -

  Rarefaction

-```{r rarefaction, echo=FALSE, cache=FALSE} -##Rarefaction curves -rarefaction <- rarecurve(data5[,2:cols]) - -##rarefied dataset -raredata <- as.data.frame(rrarefy(data5[,2:cols], sample=minreads)) -``` -
-
-
-
-
- - -

  Diversity Analyses Plots

- -
-
- -### Shannon diversty - -
-```{r diversity_analysis, echo=FALSE} -metadata <- data5[,(cols+1):data5dim[2]] -metadata$sample <- data5$sample -index <-diversity(raredata, index= "shannon") -shannondata5 <- as.data.frame(index) -shannondata5$sample<- data5$sample -shannondata5_2 <- merge(shannondata5, metadata, by="sample") - -#shannonplot <- ggplot(shannondata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Index",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#sh<-ggplotly(shannonplot) -#sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) -#sh - -sh <- plot_ly(shannondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#sh <- sh %>% layout(title = list(text="Shannon diversty",y=.99)) -sh <- sh %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -sh <- sh %>% config(toImageButtonOptions=list(format='svg',filename='ShannonDiv', height= 500, width= 800, scale= 1)) -sh - -if (params$stats == "true" ) { - shannonaov <- aov(index ~ treatment, data= shannondata5_2) - st <- shapiro.test(resid(shannonaov)) - bt <- bartlett.test(index ~ treatment, data= shannondata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(shannonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= shannondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(shannonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= shannondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= shannondata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(shannondata5_2$index, shannondata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(shannonaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
- -### Simpson diversty - -
-```{r diversity_analysis2, echo=FALSE} -index <- diversity(raredata, index= "simpson") -simpsondata5 <- as.data.frame(index) -simpsondata5$sample<- data5$sample -simpsondata5_2 <- merge(simpsondata5, metadata, by="sample") - -#simpsonplot <- ggplot(simpsondata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Index",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#s<-ggplotly(simpsonplot) -#s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) -#s - -s <- plot_ly(simpsondata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#s <- s %>% layout(title = list(text="Simpson diversty",y=.99)) -s <- s %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -s <- s %>% config(toImageButtonOptions=list(format='svg',filename='SimpsonDiv', height= 500, width= 800, scale= 1)) -s - -if (params$stats == "true" ) { - simpsonaov <- aov(index ~ treatment, data= simpsondata5_2) - st <- shapiro.test(resid(simpsonaov)) - bt <- bartlett.test(index ~ treatment, data= simpsondata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(simpsonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= simpsondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(simpsonaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= simpsondata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= simpsondata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(simpsondata5_2$index, simpsondata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(simpsonaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
- -### Species Richness - -
-```{r diversity_analysis3, echo=FALSE} -mind5<-min(data5$sum) -index <- rarefy(data5[,2:cols], sample=mind5) -rarerichnessdata5 <- as.data.frame(index) -rarerichnessdata5$sample <-data5$sample -richdata5_2 <- merge(rarerichnessdata5, metadata, by="sample") - -#richnessplot <- ggplot(richdata5_2, aes(x=treatment,color=treatment,y=index))+ -# geom_boxplot()+ -# geom_point()+ -# theme_classic()+ -# labs(y="Richness",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -ri <- plot_ly(richdata5_2, x=~treatment, y=~index, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#ri <- ri %>% layout(title = list(text="ASV Richness",y=.99)) -ri <- ri %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Index"), xaxis=list(title = "Treatment")) -ri <- ri %>% config(toImageButtonOptions=list(format='svg',filename='SpeciesRich', height= 500, width= 800, scale= 1)) -ri - -if (params$stats == "true" ) { - richaov <- aov(index ~ treatment, data= richdata5_2) - st <- shapiro.test(resid(richaov)) - bt <- bartlett.test(index ~ treatment, data= richdata5_2) - - if (st$p.value > .05 && bt$p.value > .05) { - print("Shapiro Test of normality - data is normal p-value > 0.05") - print(shapiro.test(resid(richaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05") - print(bartlett.test(index ~ treatment, data= richdata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA Results") - print(summary(richaov)) - #Tukey Honest Significant Differences (pairwise comparison) - significant p <.05 - writeLines("\n--------------------------------------------------------------\n") - print("Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05") - print(TukeyHSD(richaov)) - writeLines("\n--------------------------------------------------------------\n") - } else { - print("Shapiro Test of normality - data is normal if p-value > 0.05") - print(shapiro.test(resid(richaov))) - writeLines("\n--------------------------------------------------------------\n") - print("Bartlett Test variance homogeneity - variance is homogeneous if p-value > 0.05") - print(bartlett.test(index ~ treatment, data= richdata5_2)) - writeLines("\n--------------------------------------------------------------\n") - print("Data either not normal or variance not homogenous") - print("Kruskal-Wallis Test - test significant if p <.05") - #Kruskal-Wallis test - significant p <.05 - mykt <- kruskal.test(index ~ treatment, data= richdata5_2) - print(mykt) - writeLines("\n--------------------------------------------------------------\n") - if (mykt$p.value < .05) { - #Pairwise comparison - print("Wilcox.test - pairwise comparison") - print(pairwise.wilcox.test(richdata5_2$index, richdata5_2$treatment, p.adjust.method = "BH")) - } else { - print("Data not significant. Skipping pairwise comparison") - } - writeLines("\n--------------------------------------------------------------\n") - print("ANOVA - one or more of the assumptions not met, take with a grain of salt.") - print(summary(richaov)) - writeLines("\n--------------------------------------------------------------\n") - } -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
-
- -

  Distance To Centroid

-
-```{r distance, echo=FALSE} -##Distance -intermediate <- raredata -bray.distance <- vegdist(sqrt(intermediate), method="bray") - -##Dispersion -disper <- betadisper(bray.distance, group = metadata$treatment, type="centroid") -df <- data.frame(Distance_to_centroid=disper$distances,Group=disper$group) -df$sample <- data5$sample -df2 <- merge(df, metadata, by="sample") - -#p<- ggplot(data=df2,aes(x=treatment,y=Distance_to_centroid,colour=treatment))+ -# geom_boxplot(outlier.alpha = 0)+ -# theme_classic()+ -# geom_point(position=position_dodge(width=0.75))+ -# labs(y="Distance",x="Treatment")+ -# theme(axis.text=element_text(size=12))+ -# theme(legend.position = "none") - -#cen <- ggplotly(p) -#cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) -#cen - -cen <- plot_ly(df2, x=~treatment, y=~Distance_to_centroid, color=~treatment, colors=mycol, type="box", boxpoints = "all", pointpos = 0, jitter = 0.5) -#cen <- cen %>% layout(title = list(text="Distance to centroid",y=.99)) -cen <- cen %>% layout(legend = list(x=10,y=.5), yaxis=list(title = "Distance"), xaxis=list(title = "Treatment")) -cen <- cen %>% config(toImageButtonOptions=list(format='svg',filename='Dispersion', height= 500, width= 800, scale= 1)) -cen - -if (params$stats == "true" ) { - adn <- adonis(bray.distance~data5$treatment) - adn -} else { - print("Stats skipped. To toggle on use \"--stats run\" in vAMPirus launch command") -} -``` -
-
-
-
-
- -

  NMDS Plots

- -
- -### 2D NMDS - -
-```{r nmds2d, echo=FALSE} -##NMDS -datax <- decostand(raredata,method="total") #method 'total' normalizes data to sum up to 1 --data5[,2:cols] - -MDS <- metaMDS(sqrt(datax), - distance = "bray",autotransform = FALSE, - k = 2, - maxit = 999, - trymax = params$try, - wascores = TRUE) - -if (MDS$converged == "TRUE") { - -data.scores <- as.data.frame(scores(MDS)) -data.scores$sample <- data5$sample -data.scores.2 <- merge(data.scores, metadata, by="sample") - -p <- ggplot(data.scores.2, aes(x=NMDS1, y=NMDS2,color=treatment))+ - geom_point(size=2)+ - theme_classic()+ - theme(legend.title = element_blank()) - -#fff <- ggplotly(p) -#fff <- fff %>% layout(legend=list(y=.5)) -#fff - -fff <-plot_ly(data.scores.2, x=~NMDS1, y=~NMDS2, color=~treatment, colors=mycol, text = ~paste("Sample: ", sample)) -fff <- fff %>% layout(legend=list(y=.5)) -fff <- fff %>% config(toImageButtonOptions=list(format='svg',filename='2Dnmds', height= 500, width= 800, scale= 1)) -fff - -} else { - print("No Convergence") -} -``` - -
-
-
-
- -### 3D NMDS - -
- -```{r nmds3d, echo=FALSE} -MDS3 <- metaMDS(sqrt(datax), - distance = "bray",autotransform = FALSE, - k = 3, - maxit = 999, - trymax = params$try, - wascores = TRUE) - -if (MDS3$converged == "TRUE") { - -data.scores3 <- as.data.frame(scores(MDS3)) -data.scores3$sample <- data5$sample -data.scores.3 <- merge(data.scores3, metadata, by="sample") -p3d <- plot_ly(data.scores.3, x=~NMDS1, y=~NMDS2, z=~NMDS3, text=~paste("Sample: ", sample), - color=~treatment, colors=mycol, - mode = 'markers', symbol = ~treatment, symbols = c('square','circle'), - marker = list(opacity = .8,line=list(color = 'darkblue',width = 1)) - ) -p3d <- p3d %>% layout(legend=list(y=.5)) -p3d <- p3d %>% config(toImageButtonOptions=list(format='svg',filename='3Dnmds', height= 500, width= 800, scale= 1)) -p3d - -} else { - print("No Convergence") -} -``` -
-
-
-
-
- -

  OTU Abundance Per Sample

- -```{r long, echo=FALSE} -dataz <- decostand(data5[,2:cols],method="total") #method 'total' normalizes data to sum up to 1 -dataz$sample <- data5$sample -datay <- merge(dataz, metadata, by="sample") -datalong <- datay %>% - tidyr::gather(first:last, key=hit, value=reads) - -##Barplot -## add better colors -#spec_bar <- ggplot(datalong, aes(x=forcats::fct_reorder(timepoint, as.numeric(as.character(timepoint))), -# y=reads, fill=hit))+ #Consider sqrt-transforming data -# geom_bar(aes(), stat="identity", position="fill")+ -# #coord_polar("y", start=0)+ -# theme_classic()+ -# facet_wrap(colony~treatment, nrow=2)+ -# #coord_flip()+ -# labs(x="Timepoint") #+ theme(legend.position = "none") -#spec_bar <- ggplot(datalong, aes(x=sample,y=reads,fill=hit))+geom_bar(aes(), stat="identity", position="fill")+ -# theme(axis.text.x=element_text(angle=90)) -#ggplotly(spec_bar) -#,'#00b300','#00b3b3','#0059b3','#6600ff','#b800e6','#ff3333','#ff8000','#ffff00','#bf8040','#42bcf5','#b428f5','#2e8d7e','#664e7e','#a4c700','#1aa3ff' - -ddd <- plot_ly(datalong, x=~sample, y=~reads, color=~hit, colors=mycol) -ddd <- ddd %>% layout(type='bar',barmode = 'stack') -ddd <- ddd %>% layout(legend = list(x=10,y=.5), xaxis=list(title = "Sample"), yaxis=list(title = "Relative abundance")) -ddd <- ddd %>% config(toImageButtonOptions=list(format='svg',filename='Relative_abundance', height= 500, width= 800, scale= 1)) -ddd -``` -
-
-
-
-
- -

  OTU Abundance Per Treatment

- -```{r asv_barplot, echo=FALSE} -datalong <- datalong %>% - filter(reads>0) - -#asv_bar <- ggplot(datalong, aes(x=reorder(hit,reads), y=reads, fill=treatment))+ -# geom_bar(stat="identity")+ -# scale_fill_manual(values=c('#088da5','#e19336'))+ #colors not working -# coord_flip()+ -# theme_classic()+ -# theme(axis.title.y = element_blank())+ -# theme(legend.title = element_blank()) - -asp2 <- plot_ly(datalong, y=~hit, x=~reads, color=~treatment, colors=mycol,text = ~paste("Sample: ", sample), opacity=.9) -asp2 <- asp2 %>% layout(type='bar', barmode = 'group') -asp2 <- asp2 %>% layout(yaxis = list(title = '', categoryorder = "total ascending"), legend = list(x=10,y=.5)) -asp2 <- asp2 %>% config(toImageButtonOptions=list(format='svg',filename='Most_abundant_hits_per_sample', height= 500, width= 800, scale= 1)) -asp2 -``` -
-
-
-
-
- -

  Pairwise Percent-ID Heatmap

- -
-
-```{r heatmap, echo=FALSE} -heatdata=params$heatmap -#heatdata="a.csv" -#heatdata="PVID_vAMPtest1_otu.85_PercentID.matrix" -simmatrix<- read.csv(heatdata, header=FALSE) -rownames(simmatrix) <- simmatrix[,1] -simmatrix <- simmatrix[,-1] -colnames(simmatrix) <-rownames(simmatrix) -cols <- dim(simmatrix)[2] -simmatrix$AA <- rownames(simmatrix) -rval=nrow(simmatrix) -simmatrix2 <- simmatrix %>% - gather(1:rval, key=sequence, value=similarity) -x=reorder(simmatrix2$AA,simmatrix2$similarity) -y=reorder(simmatrix2$sequence,simmatrix2$similarity) -similaritymatrix <- ggplot(simmatrix2, aes(x=x, y=y,fill=similarity))+ - geom_raster()+ - scale_fill_distiller(palette="Spectral")+ - theme(axis.text.x = element_text(angle = 90))+ - theme(axis.title.x=element_blank())+ - theme(axis.title.y=element_blank()) - -heat <- ggplotly(similaritymatrix) -heat <- heat %>% config(toImageButtonOptions=list(format='svg',filename='heatmap', height= 500, width= 800, scale= 1)) -heat -``` -
-
-
-
-
- - -

  Taxonomy Result Visualization

- -
-
- -```{r taxonomy, echo=FALSE} -tax=read.csv(params$tax,header=F) -#tax=read.csv("tax.csv",header=F) -#tax=read.csv("PVID_vAMPtest1_otu.85_summary_for_plot.csv", header=F) -tp <- plot_ly(tax, labels = ~V1, values = ~V2) -tp <- tp %>% add_pie(marker=list(colors=mycol, line=list(color='#000000', width=.5)), hole = 0.6) -tp <- tp %>% layout(title = "Taxonomy distribution", showlegend = F, - xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), - yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE) - ) -tp <- tp %>% config(toImageButtonOptions=list(format='svg',filename='TaxDonut', height= 500, width= 800, scale= 1)) -tp -``` - -
-
-
-
-
- -

  Phylogenetic Tree

- -
-
- -```{r tree, echo=FALSE} -#tree=read.newick("vAMPrun_otu.55.raxml.support") -tree=read.tree(params$tree) -p1 <- ggtree(tree) -id <- tree$tip.label -dat <- tibble::tibble(id = id) -metat <- p1$data %>% dplyr::inner_join(dat, c('label' = 'id')) -p2 <- p1 + geom_point(data = metat, aes(x = x, y = y, label = id)) -ggplotly(p2, tooltip = "label") -``` -This tree is a maximum likelihood tree made with IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL. -
-
-
-
-
-
-
-
-
diff --git a/bin/virtualribosomev2/dna2pep.py b/bin/virtualribosomev2/dna2pep.py old mode 100644 new mode 100755 index 6f69015..73196d6 --- a/bin/virtualribosomev2/dna2pep.py +++ b/bin/virtualribosomev2/dna2pep.py @@ -1,4 +1,6 @@ -#!/usr/bin/env python2 +#!/usr/bin/env python3 + +# Script was modified form original to use python3 # Copyright 2006 Rasmus Wernersson, Technical University of Denmark # @@ -29,60 +31,60 @@ SYNOPSIS dna2pep [options] [input files] [-f outfile] - + DESCRIPTION TRANSLATION: The translation engine of dna2pep has full support for handling degenerate nucleotides (IUPAC definition, e.g. W = A or T, S = G or C). - All translation table defined by the NCBI taxonomy group is included, + All translation table defined by the NCBI taxonomy group is included, and a number of options determining the behaviour of STOP and START codons is avialable. - - INTRON and EXONS: dna2pep natively understands TAB files containing + + INTRON and EXONS: dna2pep natively understands TAB files containing Intron/Exon annotation (gb2tab / FeatureExtract). When translating files containing Intron/Exon structure, dna2pep will annotate the underlying gene-structure in the annotation of the translated sequence. - + Input files can be in FASTA (no Intron/Exon annotation) RAW (single sequence with no header - all non-letters are discarded) or TAB (incluing annotation) FORMAT. The output format will by default be FASTA - for files without annotation and TAB for files including annotation. + for files without annotation and TAB for files including annotation. The file format is autodetected by investigating the first line of the input. If no input files are specified, dna2pep will read from STDIN. - + OPTIONS -F, --outfile - Optional - specify an output file. If no output file is + Optional - specify an output file. If no output file is specified the output will go to STDOUT. - + -O, --outformat - Specify output format (see also the --fasta, --tab, + Specify output format (see also the --fasta, --tab, --report options below): - + FASTA: Fasta format (plain DNA, no sequence annotation) - + TAB: Tab format. Each line contains the following four fields, separated by tabs: name, seq, ann, comment - + See gb2tab (FeatureExtract) for details. - - REPORT: A nice visualization of the results. - - AUTO: [Default] Generate a both a report and sequence output - (use the same format as the one detected from the for + + REPORT: A nice visualization of the results. + + AUTO: [Default] Generate a both a report and sequence output + (use the same format as the one detected from the for the input files). --fasta filename Write output sequences in FASTA format to the specified file. Use '-' to indicate STDOUT. - + --tab filename Write output sequences in TAB format to the specified file. Use '-' to indicate STDOUT. - + --report filename Write report to the specified file. Use '-' to indicate STDOUT. @@ -90,121 +92,121 @@ -m, --matrix tablename/file Use alternative translation matrix instead of the build-in Standard Genetic Code for translation. - - If "tablename" is 1-6,9-16 or 21-23 one of the alternative - translation tables defined by the NCBI taxonomy group will be + + If "tablename" is 1-6,9-16 or 21-23 one of the alternative + translation tables defined by the NCBI taxonomy group will be used. - + Briefly, the following tables are defined: ----------------------------------------- - 1: The Standard Code - 2: The Vertebrate Mitochondrial Code - 3: The Yeast Mitochondrial Code - 4: The Mold, Protozoan, and Coelenterate Mitochondrial Code - and the Mycoplasma/Spiroplasma Code - 5: The Invertebrate Mitochondrial Code - 6: The Ciliate, Dasycladacean and Hexamita Nuclear Code - 9: The Echinoderm and Flatworm Mitochondrial Code - 10: The Euplotid Nuclear Code - 11: The Bacterial and Plant Plastid Code - 12: The Alternative Yeast Nuclear Code - 13: The Ascidian Mitochondrial Code - 14: The Alternative Flatworm Mitochondrial Code - 15: Blepharisma Nuclear Code - 16: Chlorophycean Mitochondrial Code - 21: Trematode Mitochondrial Code - 22: Scenedesmus obliquus mitochondrial Code - 23: Thraustochytrium Mitochondrial Code - + 1: The Standard Code + 2: The Vertebrate Mitochondrial Code + 3: The Yeast Mitochondrial Code + 4: The Mold, Protozoan, and Coelenterate Mitochondrial Code + and the Mycoplasma/Spiroplasma Code + 5: The Invertebrate Mitochondrial Code + 6: The Ciliate, Dasycladacean and Hexamita Nuclear Code + 9: The Echinoderm and Flatworm Mitochondrial Code + 10: The Euplotid Nuclear Code + 11: The Bacterial and Plant Plastid Code + 12: The Alternative Yeast Nuclear Code + 13: The Ascidian Mitochondrial Code + 14: The Alternative Flatworm Mitochondrial Code + 15: Blepharisma Nuclear Code + 16: Chlorophycean Mitochondrial Code + 21: Trematode Mitochondrial Code + 22: Scenedesmus obliquus mitochondrial Code + 23: Thraustochytrium Mitochondrial Code + See http://www.ncbi.nlm.nih.gov/Taxonomy [Genetic Codes] for a detailed description. Please notice that the table of start codons is also used (see the --allinternal option below for details). - + If a filename is supplied the translation table is read from - file instead. - + file instead. + The file should contain one line per codon in the format: - + codonaa-single letter code - - All 64 codons must be included. Stop codons is specified + + All 64 codons must be included. Stop codons is specified by "*". T and U is interchangeable. Blank lines and lines starting with "#" are ignored. - + See the "gcMitVertebrate.mtx" file in the dna2pep source distribution for a well documented example. -r x, --readingframe=x Specify the reading frame. For input files in TAB format this options is ignored, and the reading frame is build from the - annotated Intron/Exon structure. - + annotated Intron/Exon structure. + 1: Reading frame 1 (e.g. ATGxxxxxx). DEFAULT. 2: Reading frame 2 (e.g. xATGxxxxx). 3: Reading frame 3 (e.g. xxATGxxxx). - + -1: Reading frame 1 on the minus strand. -2: Reading frame 2 on the minus strand. -3: Reading frame 3 on the minus strand. - - all: Try all reading frames. + + all: Try all reading frames. This option also implies the -x option. - + plus: All positive reading frames. This option also implies the -x option. - + minus: All negative reading frames. This option also implies the -x option. - + -o mode, --orf mode - Report longest ORF in the reading frame(s) specified with the + Report longest ORF in the reading frame(s) specified with the -r option. - - Mode governs which criterias are used to allow the opening of - an ORF. "Strict start codons" => codons _always_ coding for - methione (e.g. ATG in the standard code), "Minor start codons" - => codon only coding for methionine at the start positon - (e.g. TTG in the standard genetic code). - + + Mode governs which criterias are used to allow the opening of + an ORF. "Strict start codons" => codons _always_ coding for + methione (e.g. ATG in the standard code), "Minor start codons" + => codon only coding for methionine at the start positon + (e.g. TTG in the standard genetic code). + Mode can be: ------------ strict: Open an ORF at "strict start codons" only. any: Open an ORF at any start codon. - none: Do not use start codons - look for the longest + none: Do not use start codons - look for the longest fragment before a STOP codon. - - The DNA fragment usedfor encoding the ORF will be added to the + + The DNA fragment usedfor encoding the ORF will be added to the comment field (TAB format only). - + -a, --allinternal By default the very first codon in each sequences is assumed to be the initial codon on the transcript. This means certain - non-methionine codons actually codes for metionine at this + non-methionine codons actually codes for metionine at this position. For example "TTG" in the standard genetic code (see above). - - Selecting this option treats all codons as internal codons. - + + Selecting this option treats all codons as internal codons. + -x, --readthroughstop Allow the translation to continue after a stop codon is reached. The stop codon will be marked as "*". - + -p, --plain, --ignoreannotation Ignore annotation for TAB files. If this options is selected TAB files will be treated in same way as FASTA files. - -c, --comment + -c, --comment Preserve the comment field in TAB files. Normally the comment field is silently dropped, since it makes no sense for FASTA files. - + -C, --processcomment Works as the -c option described above, except a bit of intelligent parsing is done on the comment field: If a "/spliced_product" sub-field is found (from TAB files create by FeatureExtract / gb2tab) only the part of the comment field before the DNA specific information - is kept in the comment field. + is kept in the comment field. -e, --exonstructure Default for TAB files. Annotate the underlying exons structure @@ -212,19 +214,19 @@ are fully or partially encoded within the first exon get the annotation character "1", positions in the secon exon get the character "2" etc. - + The hex-decimal system is used, which means up to 15 exons can be uniquely annotated, before the numbering wraps around to "0". - + -i, --intronphase Annotate where an intron interrupted the DNA sequences, and how the intron did cut the readingframe. - + 0 : phase-0 intron (inbetween the previous and current position). 1 : phase-1 intron. 2 : phase-2 intron. - - + + AUTHOR Rasmus Wernersson, raz@cbs.dtu.dk Feb-Mar 2006 @@ -234,11 +236,11 @@ WEB PAGE http://www.cbs.dtu.dk/services/VirtualRibosome/ - + REFERENCE Rasmus Wernersson Virtual Ribosome - Comprehensive DNA translation tool. - Submitted to Nucleic Acids Research, 2006 + Submitted to Nucleic Acids Research, 2006 """ import sys, re, mod_translate,string @@ -248,7 +250,7 @@ complDNA = "TAACGRYWSMKVHDBN" allValid = validDNA+validDNA.lower() -transTable = string.maketrans(validDNA+validDNA.lower(),complDNA+complDNA.lower()) +transTable = str.maketrans(validDNA+validDNA.lower(),complDNA+complDNA.lower()) pwidth = 90 @@ -271,41 +273,41 @@ def makePretty(title,vals,labels,max_len): val = vals[j] lab = labels[j] pos = min(max_len,i+pwidth) - if lab: + if lab: spos = "%d" % pos else: spos = "" - + s = "%-2s %s %s\n" % (lab,val[i:pos],spos) l.append(s) l.append("\n") - return l + return l def explodePep(s): l = list(s) return " "+" ".join(l)+" " - + def revCom(dna): l = list(dna) l.reverse() rev_dna = "".join(l) return rev_dna.translate(transTable) - + def dnaComplement(dna): return dna.translate(transTable) - + def revStr(s): l = list(s) l.reverse() return "".join(l) - + def isDNAValid(dna): for c in dna: if not c in allValid: return False - + return True - + def combineToTab(name, seql): seq = "".join(seql) ann= "."*len(seq) @@ -320,19 +322,19 @@ def readRaw(lines): sl = [] for c in line.strip().upper(): if c.isalpha(): sl.append(c) - + seql.append("".join(sl)) - + l.append(combineToTab("Seq1",seql)) - return l - + return l + # Read FASTA format def readFasta(lines): l = [] - + name = "" seql = [] - + for line in lines: line = line.strip() if line.startswith(">"): @@ -342,104 +344,105 @@ def readFasta(lines): seql = [] else: seql.append(line) - + if name: l.append(combineToTab(name,seql)) - + return l -# read TAB format +# read TAB format def readTab(lines): l = [] - + for line in lines: line = line.strip() tokens = line.split("\t") if len(tokens) < 2: continue - + name = tokens[0] seq = tokens[1] - + if len(tokens) > 2: ann = tokens[2] else: ann = "."*len(seq) - + if len(tokens) > 3: com = tokens[3] else: com = "" - + l.append( (name,seq,ann,com) ) - + return l - + def readInput(lines): if not lines: return ([], True) - - line = lines[0] + + line = lines[0] tokens = line.split("\t") - + if line.startswith(">") and len(tokens) < 3: l = readFasta(lines) isFasta = True - + elif len(tokens) == 1: l = readRaw(lines) isFasta = True - + else: l = readTab(lines) isFasta = False - + return (l, isFasta) - + def writeFasta(seqs, outstream): for (name, seq, ann, com) in seqs: - print >> outstream, ">"+name + print(">"+name, file=outstream) for i in range(0,len(seq),60): - print >> outstream, seq[i:i+60] - + print(seq[i:i+60], file=outstream) + def writeTab(seqs, outstream): for tokens in seqs: - print >> outstream, "\t".join(tokens) + print("\t".join(tokens), file=outstream) def openForWriteOrDie(outfile): try: - outstream = file(outfile,"w") + outstream = open(outfile,"w") - except IOError, (strerror): - print >> sys.stderr, "ERROR - cannot write to the specified file %s [%s]" % (outfile,strerror) + except IOError as xxx_todo_changeme: + (strerror) = xxx_todo_changeme + print("ERROR - cannot write to the specified file %s [%s]" % (outfile,strerror), file=sys.stderr) sys.exit(-1) - + return outstream def parseOpts(): - # Quick hack to overrule the -h and --help feature - # build into the optpase module - if "-h" in sys.argv or "--help" in sys.argv: - print __doc__ - sys.exit(0) + # Quick hack to overrule the -h and --help feature + # build into the optpase module + if "-h" in sys.argv or "--help" in sys.argv: + print(__doc__) + sys.exit(0) parser = OptionParser() - + # File handling parser.add_option("-F","--outfile", type="string", dest="outfile", default="") parser.add_option("-O","--outformat", type="string", dest="outformat", default="AUTO") parser.add_option("--tab", type="string", dest="tabfile", default="") parser.add_option("--fasta", type="string", dest="fastafile", default="") parser.add_option("--report", type="string", dest="reportfile", default="") - + # Matrix parser.add_option("-m","--matrix", type="string", dest="matrix", default="1") - + # Reading frame parser.add_option("-r","--readingframe", type="string", dest="readingframe", default="1") - - # ORF finding + + # ORF finding parser.add_option("-o","--orf", type="string", dest="orf", default="") - + # Stop and Start codons parser.add_option("-a","--allinternal", action="store_true", dest="allinternal", default = False) parser.add_option("-x","--readthroughstop", action="store_true", dest="readthroughstop", default = False) @@ -453,40 +456,40 @@ def parseOpts(): parser.add_option("-c","--comment", action = "store_true", dest = "keepcomment", default=False) parser.add_option("-C","--processcomment", action = "store_true", dest = "processcomment", default=False) - # Debug - parser.add_option("-d","--debug", action="store_true", dest="debug", default=False) - - (opt, args) = parser.parse_args() - + # Debug + parser.add_option("-d","--debug", action="store_true", dest="debug", default=False) + + (opt, args) = parser.parse_args() + # Check reading frame if not opt.readingframe in ["1", "2", "3", "-1", "-2", "-3", "all", "plus","minus"]: sys.stderr.write("Invalid reading frame [%s]\n" % opt.readingframe) sys.exit(-1) - + if opt.readingframe in ["all","plus","minus"]: opt.readthroughstop = True - + # Chech ORF mode if not opt.orf in ["","strict","any","none"]: - print >> sys.stderr, "Invalid ORF mode [%s]\n" % opt.orf + print("Invalid ORF mode [%s]\n" % opt.orf, file=sys.stderr) sys.exit(-1) - + # Chech output format opt.outformat = opt.outformat.upper() if not opt.outformat in ["AUTO","TAB","FASTA","REPORT"]: - print >> sys.stderr("Invalid output format [%s]\n" % opt.outformat) + print(file=sys.stderr("Invalid output format [%s]\n" % opt.outformat)) sys.exit(-1) - + # Check mutually exclusive options if opt.intronrf: opt.exonann=False - + return (opt, args) if __name__ == "__main__": reports = [] pepseqs = [] opt, args = parseOpts() - + # Initialize translation matrix mtx = mod_translate.parseMatrixFile(opt.matrix) if not mtx: @@ -504,16 +507,16 @@ def parseOpts(): lines = [] for fn in args: try: - lines += (file(fn,"r").readlines()) + lines += (open(fn,"r").readlines()) except: sys.stderr.write("ERROR: Cannot read from file '%s'\n" % fn) sys.exit(-1) - + try: (seq_list, isFasta) = readInput(lines) - except Exception, msg: - print >> sys.stderr, "ERROR parsing input files. Please verify the format (FASTA, RAW or TAB)" - print >> sys.stderr, "[%s]" % (str(msg)) + except Exception as msg: + print("ERROR parsing input files. Please verify the format (FASTA, RAW or TAB)", file=sys.stderr) + print("[%s]" % (str(msg)), file=sys.stderr) sys.exit(-1) # Ignore annotation? @@ -525,36 +528,36 @@ def parseOpts(): for (name, dna, ann, com) in seq_list: seq_proc = ann_proc = exnum_proc = "" - + # Test if the DNA sequence is valid if not isDNAValid(dna): sys.stderr.write("Non IUPAC characters is detected in sequence '%s' - skipping this entry\n" %name) #sys.stderr.write("seq: %s\n" % seq) continue - + # Files without intron/exon annotation ------------------------------------------- if isFasta: # reports.append([ORF_ANNOTATION]) - + d_collect = {} if opt.readingframe in ["all"]: rf_list = ["1","2","3","-1","-2","-3"] echo_rf = True - + elif opt.readingframe == "plus": rf_list = ["1","2","3"] echo_rf = True - + elif opt.readingframe == "minus": rf_list = ["-1","-2","-3"] echo_rf = True - + else: rf_list = [opt.readingframe] echo_rf = False - + for rf in rf_list: - + # Find current reading frame if rf == "1": qseq = dna @@ -574,19 +577,19 @@ def parseOpts(): # Do the actual translation pep = mod_translate.translate(qseq,mtx,not opt.allinternal,opt.readthroughstop) pa = mod_translate.annotate(qseq,mtx) - + # The annotation string may be longer that the peptide, if the -x more is not used - pa = pa[:len(pep)] + pa = pa[:len(pep)] # Store translated sequence if echo_rf: cname = name+"_rframe"+rf else: cname = name - + data = ( cname,pep,pa,qseq ) d_collect[rf] = data - + # Do ORF finding? if opt.orf: #Find longest ORF @@ -594,7 +597,7 @@ def parseOpts(): bestspan = (0,0) bestdata = None bestrf = "" - for key in d_collect.keys(): + for key in list(d_collect.keys()): data = d_collect[key] seq = data[1] ann = data[2] @@ -617,52 +620,52 @@ def parseOpts(): j = j_strict else: j = min(j_strict,j_any) - + if j == -1: break - + #Step 2 - find stop m = ann.find("*",j) if m == -1: m = len(ann) - - #print j,m - + + #print j,m + if (m - j) > bestlen: bestlen = m - j bestspan = (j, m) bestdata = data bestrf = key - + #print rf, bestlen #print seq[j:m+1] #print ann[j:m+1] - + j = m # Format the best hit if not bestdata: - bestdata = d_collect.values()[0] + bestdata = list(d_collect.values())[0] bestspan = (0,0) - bestrf = d_collect.keys()[0] - + bestrf = list(d_collect.keys())[0] + msg = "NO ORF FOUND (given the criteria '%s') for sequence '%s'\n\n" % (opt.orf,name) - reports.append([msg]) - + reports.append([msg]) + # clist = list(bestdata[1].lower()) name = bestdata[0] pep = bestdata[1] ann = bestdata[2] dna_work = bestdata[3] - + bpos, epos = bestspan orf_dna = dna_work[bpos*3:epos*3] orf = mod_translate.translate(orf_dna,mtx,True,False) new_pep = " "*bpos + orf + " "*(len(pep)-epos) - + name += "_ORF" - + d_collect = {} d_collect[bestrf] = (name,new_pep,ann,orf_dna) #print d_collect - + # Processing and Pretty printing... if opt.readingframe in ["all","plus","minus"] and (not opt.orf): if opt.readingframe == "all": @@ -677,17 +680,17 @@ def parseOpts(): dna_plus = dna dna_minus = revCom(dna) - + #dnapl = list(dna_plus) #dnaml = list(dna_minus) # dnapann = [" "]*len(dna_plus) # dnamann = [" "]*len(dna_minus) dnapann = ["."]*len(dna_plus) dnamann = ["."]*len(dna_minus) - + vals = [] labels = [] - + if doPlus: pep1 = d_collect["1"][1] ann1 = d_collect["1"][2] @@ -720,8 +723,8 @@ def parseOpts(): #dna_plus = "".join(dnapl) vals += [pep3e,pep2e,pep1e,dna_plus,"".join(dnapann)] labels += ["","","","5'",""] - - if doMinus: + + if doMinus: pepm1 = d_collect["-1"][1] annm1 = d_collect["-1"][2] pepm1e = explodePep(pepm1)+" "+" " @@ -741,7 +744,7 @@ def parseOpts(): peps = [pepm1,pepm2,pepm3] maxlen = max(len(pepm1),len(pepm2)) maxlen = max(maxlen,len(pepm3)) - + for i in range(0,maxlen): for j in range(0,3): ann = anns[j] @@ -766,37 +769,37 @@ def parseOpts(): # vals = [pep3,pep2,pep1,dna_plus,revStr(dna_minus),revStr(pepm1),revStr(pepm2),revStr(pepm3)] # labels = ["","","","5'","3'","","",""] title = "%s - reading frame(s): %s" % (name,opt.readingframe) - + l = makePretty(title,vals,labels,len(dna_plus)) - + ###print "".join(l) reports.append(l) - + else: - rf_list = d_collect.keys() + rf_list = list(d_collect.keys()) rf = rf_list[0] - + vals = [] labels = [] title = "%s\nReading frame: %s" % (name,rf) - + #rf_int = abs(int(rf)) - 1 pep = d_collect[rf][1] pepx = explodePep(pep) - + vals = [] labels = [] - + vals.append(pepx) labels.append("") - + pa = d_collect[rf][2] pax = ["."]*len(pa)*3 if rf.startswith("-"): dna_minus = revCom(dna) #print len(dna_minus) dnamann = ["."] * len(dna_minus) - if rf == "-1": + if rf == "-1": pepm = d_collect["-1"][1] annm = d_collect["-1"][2] pepme = explodePep(pepm)+" "+" " @@ -811,9 +814,9 @@ def parseOpts(): annm = d_collect["-3"][2] pepme = " "+explodePep(pepm)+" "+" " j = 2 - + pepme = pepme[:len(dna_minus)] - + for i in range(0,len(pepm)): ac = annm[i] if ac in ["M","m","*"]: @@ -852,42 +855,42 @@ def parseOpts(): if ac in ["M","m","*"]: dnapos = (i*3) + j if ac == "M": c = ">" - elif ac == "m": c = ")" + elif ac == "m": c = ")" else: c = "*" for k in range(dnapos,dnapos+3): dnapann[k] = c - + ann = "".join(dnapann) vals = [pepx,dna_plus,ann] - labels = ["","5'",""] - - + labels = ["","5'",""] + + l = makePretty(title,vals,labels,len(dna)) reports.append(l) ###print "".join(l) - - - for rf in rf_list: + + + for rf in rf_list: name,seq,ann,com = d_collect[rf] new_com = "" - + # Seq may contain leding and trailing spaces if we are in ORF finde mode if opt.orf: new_seq = [] new_ann = [] for i in range(0,len(seq)): - if seq[i] <> " ": + if seq[i] != " ": new_seq.append(seq[i]) new_ann.append(ann[i]) - + seq = "".join(new_seq) ann = "".join(new_ann) - + new_com = '/orf_mode="%s"; /dna="%s";' % (opt.orf,com) - + pepseqs.append( (name,seq,ann,new_com) ) # pepseqs.append( (cname,pep,pa,com.strip()) ) - + # Files with intron/exon annotation --------------------------------------------- else: exon_count = 1 @@ -895,17 +898,17 @@ def parseOpts(): start, end = mo.span() seq_proc += dna[start:end] ann_proc += ann[start:end] - + exnum_chr = "%x" % exon_count exnum_proc += exnum_chr * (end-start) exon_count = (exon_count + 1) % 0x10 - + # DEBUG if opt.debug: - print seq_proc - print ann_proc - print rf_proc - print exnum_proc + print(seq_proc) + print(ann_proc) + print(rf_proc) + print(exnum_proc) # Process comments if opt.keepcomment or opt.processcomment: @@ -928,37 +931,37 @@ def parseOpts(): rf_proc = "012" * (len(ann_proc) / 3) rf_proc += "012"[:len(ann_proc) % 3] - for i in range(1,len(pa)*3): # Skip first position + for i in range(1,len(pa)*3): # Skip first position if ann_proc[i] == "(": pa[i/3] = rf_proc[i] # Store translated sequence prot_ann = "".join(pa) pepseqs.append( (name,pep,prot_ann,com.strip()) ) - + # Pretty printing for the report title = "%s - " % (name) if opt.exonann: title += "translation and annotation of the exonic structure" - + elif opt.intronrf: title += "translation and annotation of the position and phase of the introns" - + vals = [pep,prot_ann] labels = ["pep:","ann:"] l = makePretty(title,vals,labels,len(pep)) reports.append(l) - + # Output the results ------------------------------------------------------- # Step 1) Combined results. if opt.outfile: outstrem = openForWriteOrDie(opt.outfile) else: outstream = sys.stdout - + if opt.outformat in ["REPORT","AUTO"]: for l in reports: outstream.writelines(l) - print >> outstream, "//" - + print("//", file=outstream) + if "TAB" == opt.outformat: outFasta = False elif "FASTA" == opt.outformat: @@ -970,24 +973,24 @@ def parseOpts(): writeFasta(pepseqs,outstream) else: writeTab(pepseqs,outstream) - - #outstream.close() + + #outstream.close() # Step 2) Write specifik sub-result if requested if opt.reportfile: if opt.reportfile == "-": outstream = sys.stdout else: outstream = openForWriteOrDie(opt.reportfile) - + for l in reports: outstream.writelines(l) - + if opt.fastafile: if opt.fastafile == "-": outstream = sys.stdout else: outstream = openForWriteOrDie(opt.fastafile) - + writeFasta(pepseqs,outstream) - + if opt.tabfile: if opt.tabfile == "-": outstream = sys.stdout else: outstream = openForWriteOrDie(opt.tabfile) - + writeTab(pepseqs,outstream) diff --git a/bin/virtualribosomev2/mod_translate.py b/bin/virtualribosomev2/mod_translate.py old mode 100644 new mode 100755 index c9017f9..af9af91 --- a/bin/virtualribosomev2/mod_translate.py +++ b/bin/virtualribosomev2/mod_translate.py @@ -1,4 +1,6 @@ -#!/usr/local/python/bin/python +#!/usr/bin/env python3 + +# Script was modified form original to use python3 # Copyright 2002,2003,2004,2005 Rasmus Wernersson, Technical University of Denmark # @@ -35,7 +37,7 @@ def __init__(self): self.description = "" self.d_all = {} self.d_first = {} - + def toString(self): return "Description: %s\nd_all: %s\nd_first: %s" % (self.description,str(self.d_all),str(self.d_first)) @@ -96,7 +98,7 @@ def toString(self): iupac["N"] = "ACGT" #aNy -alphaDNA = "ACGTRYMKWSBDHVN" +alphaDNA = "ACGTRYMKWSBDHVN" alphaDNAStrict = "ACGT" alphaPep ="*ACDEFGHIKLMNPQRSTVWY" @@ -112,19 +114,19 @@ def parseNcbiTable(lines): for line in lines.split("\n"): line = line.strip() #print "!!!"+line - + if line.startswith("name ") and (desc == ""): dRec.description += line.split('"')[1]+" " - + elif line.startswith("id "): tab_id = line.split()[1] - + elif line.startswith("ncbieaa"): aa_all = line.split('"')[1] - + elif line.startswith("sncbieaa"): aa_first = line.split('"')[1] - + c = 0 for b1 in "TCAG": for b2 in "TCAG": @@ -134,7 +136,7 @@ def parseNcbiTable(lines): aaf = aa_first[c] if aaf == "-": aaf = aa_all[c] dRec.d_first[codon] = aaf - + c += 1 result[tab_id] = dRec dRec = TransTableRec() @@ -145,51 +147,51 @@ def parseMatrixLines(iterator): result = {} for line in iterator: line = line.strip() - - if not line: + + if not line: continue # Ignore blank lines - if line.startswith("#"): + if line.startswith("#"): continue # Ignore comment lines - + tokens = line.split() try: codon, aa = tokens - + # Skip invalid entries - if len(codon) <> 3: + if len(codon) != 3: badCodon = 1 else: codon = codon.upper().replace("U","T") - for c in codon: + for c in codon: if not c in alphaDNAStrict: badCodon = 1 badCodon = 0 - + if badCodon: raise "Bad codon: %s [%s]" % (codon,line) - - - if len(aa) <> 1: #or (not aa in alphaPep): + + + if len(aa) != 1: #or (not aa in alphaPep): raise "Bad aa: %s [%s]" % (aa,line) - + result[codon] = aa - - except Exception, e: + + except Exception as e: if DEBUG: sys.stderr.write("Matrix Error - %s\n" % e) - + if len(d) != 64 and DEBUG: sys.stderr.write("Matrix Error - size of matrix differs from 64 [%i]\n" % len(d)) - + return result - + def parseMatrixFile(filename): - if d_ncbi_table.has_key(filename): + if filename in d_ncbi_table: dRec = d_ncbi_table[filename] return dRec - + dRec = TransTableRec() - dRec.d_all = parseMatrixLines(open(filename,"r").xreadlines()) + dRec.d_all = parseMatrixLines(open(filename,"r")) dRec.d_first = dRec.d_all dRec.description = "Custom translation table '%s'" % filename return dRec @@ -208,29 +210,29 @@ def trim_old(seq): # Assumption: Degenerate codons are rare - speed is not an issue def decode(codon,dRec,isFirst): if len(codon) != 3: return [] - + #Use the relevant translation table if isFirst: d_gc = dRec.d_first else: d_gc = dRec.d_all - + #Check for the simple case: this is a standard non-degenerate codon - if d_gc.has_key(codon): + if codon in d_gc: return [ d_gc[codon]] - + #The codon is to some degree degenerate - start the whole recursive scheme result = [] - + for i in range(0,3): p = iupac[codon[i]] if (len(p) > 1): for c in p: result += (decode(codon[0:i]+c+codon[i+1:3],dRec,isFirst)) return result - + if len(p) == 0: return [] # Unknown/illegal char #return [d[codon]] - + def condense(lst): result = [] for e in lst: @@ -239,7 +241,7 @@ def condense(lst): def translate(seq,transRec): return translate(seq,transRec,True,True) - + def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon): debug = False if not transRec: @@ -248,14 +250,14 @@ def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon): result = [] seq = trim(seq) if firstIsStartCodon: isFirst = True - else: isFirst = False + else: isFirst = False for i in range(0,len(seq),3): aa = condense(decode(seq[i:i+3],transRec,isFirst)) - if debug: print seq[i:i+3], aa + if debug: print(seq[i:i+3], aa) if aa: - if aa[0] == "*" and not readThroughStopCodon: + if aa[0] == "*" and not readThroughStopCodon: break - + if len(aa) == 1: result.append(aa[0]) else: s = string.join(aa,"") @@ -264,12 +266,12 @@ def translate(seq,transRec,firstIsStartCodon,readThroughStopCodon): else : result.append("X") #Any #print seq[i:i+3] isFirst = False - - pepseq = "".join(result) + + pepseq = "".join(result) if debug: - print pepseq + print(pepseq) return pepseq - + # Annotate all possible start and stop codons def annotate(seq,transRec): debug = False @@ -279,18 +281,18 @@ def annotate(seq,transRec): for i in range(0,len(seq),3): codon = seq[i:i+3] aa = condense(decode(codon,transRec,True)) - if debug: print seq[i:i+3], aa + if debug: print(seq[i:i+3], aa) if aa: - if "*" in aa: + if "*" in aa: result.append("*") elif ["M"] == aa: aa_int = condense(decode(codon,transRec,False)) if aa_int == ["M"]: result.append("M") else: result.append("m") - else: - result.append(".") - - return "".join(result) + else: + result.append(".") + + return "".join(result) #def translate(seq): # translate(seq,None) @@ -298,10 +300,10 @@ def annotate(seq,transRec): try: import ncbi_genetic_codes d_ncbi_table = parseNcbiTable(ncbi_genetic_codes.ncbi_gc_table) - + except: pass - + if __name__ == "__main__": for line in sys.stdin.readlines(): - print translate(line,None,True,False) + print(translate(line,None,True,False)) diff --git a/docs/HelpDocumentation.md b/docs/HelpDocumentation.md index 5ec083b..20fb68b 100644 --- a/docs/HelpDocumentation.md +++ b/docs/HelpDocumentation.md @@ -1,56 +1,52 @@ ![vAMPirus logo](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vamplogo.png) -# Table of contents -* [Introduction to vAMPirus](#Introduction-to-vAMPirus) - * [Contact/support](#Contact/support) - * [Who to cite](#Who-to-cite) -* [Getting started with vAMPirus](#Getting-started-with-vAMPirus) - * [Order of operations](#Order-of-operations) - * [Windows OS users](#Windows-OS-users) - * [MacOS users](#MacOS-users) - * [Installing and running the VM](#Installing-and-running-the-VM-on-MacOS) - * [Install Homebrew](#Install-Homebrew) - * [Install Vagrant and Virtual Box](#Install-Vagrant-and-Virtual-Box) - * [Building and starting](#Building-and-starting-your-virtual-environment) - * [Transferring files](#Transferring-files-to-and-from-VM-with-Vagrant-scp) -* [Installing vAMPirus](#Installing-vAMPirus) - * [Cloning the repository](#Cloning-the-repository-(skip-if-you-generated-the-Vagrant-virtual-environment)) - * [Setting up vAMPirus](#Setting-up-vAMPirus-dependencies-and-checking-installation) - * [Databases](#Databases) -* [Running vAMPirus](#Running-vAMpirus) - * [Testing vAMPirus installation](#Testing-vAMPirus-installation) - * [Containers](#Using-Singularity) -* [Extra notes](#Things-to-know-before-running-vAMPirus) - - # Introduction to vAMPirus Viruses are the most abundant biological entities on the planet and with advances in next-generation sequencing technologies, there has been significant effort in deciphering the global virome and its impact in nature (Suttle 2007; Breitbart 2019). A common method for studying viruses in the lab or environment is amplicon sequencing, an economic and effective approach for investigating virus diversity and community dynamics. The highly targeted nature of amplicon sequencing allows in-depth characterization of genetic variants within a specific taxonomic grouping facilitating both virus discovery and screening within samples. Although, the high volume of amplicon data produced combined with the highly variable nature of virus evolution across different genes and virus-types can make it difficult to scale and standardize analytical approaches. Here we present vAMPirus (https://github.com/Aveglia/vAMPirus.git), an automated and easy-to-use virus amplicon sequencing analysis program that is integrated with the Nextflow workflow manager facilitation easy scalability and standardization of analyses. +![vAMPirus general workflow](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vAMPirus_generalflow.png) + The vAMPirus program contains two different pipelines: -1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform future analyses +1. DataCheck pipeline: provides the user an interactive html report file containing information regarding sequencing success per sample as well as a preliminary look into the clustering behavior of the data which can be leveraged by the user to inform analyses -![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_datacheckUPDATED.png) +![vAMPirus DataCheck](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_datacheckV2.png) 2. Analyze pipeline: a comprehensive analysis of the provided data producing a wide range of results and outputs which includes an interactive report with figures and statistics. NOTE- stats option has changed on 2/19/21; you only need to add "--stats" to the launch commmand without "run" - -![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirus/master/example_data/conf/vampirusflow_analysisUPDATED.png) - +![vAMPirus Analyze](https://raw.githubusercontent.com/Aveglia/vAMPirusExamples/main/vampirusflow_analyzeV2.png) ## Contact/support If you have a feature request or any feedback/questions, feel free to email vAMPirusHelp@gmail.com or you can open an Issue on GitHub. +## Changes in version 2.0.0 + +1. (EXPERIMENTAL) Added Minimum Entropy Decomposition analysis using the oligotyping program produced by the Meren Lab. This allows for sequence clustering based on sequence positions of interest (biologically meaningful) or top positions with the highest Shannon's Entropy (read more here: https://merenlab.org/software/oligotyping/ ; and below). + +2. Added more useful taxonomic classification of sequences leveraging the RVDB annotation database and/or NCBI taxonomy files (read more below). + +3. Replaced the used of MAFFT with muscle v5 (Edgar 2021) for more accurate virus gene alignments (see https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full). + +4. Added multiple primer pair removal to deal with multiplexed amplicon libraries. + +5. ASV filtering - you can now provide a "filter" and "keep" database to remove certain sequences from the analysis + +6. Reduced redundancy of processes and the volume of generated result files per full run (Example - read processing only done once if running DataCheck then Analyze). + +7. Color nodes on phylogenetic trees based on Taxonomy or Minimum Entropy Decomposition results + +8. PCoA plots added to report if NMDS does not converge. + + ## Who to cite If you do use vAMPirus for your analyses, please cite the following -> 1. vAMPirus - Veglia, A.J., Rivera Vicens, R., Grupstra, C., Howe-Kerr, L., and Correa A.M.S. (2020) vAMPirus: An automated virus amplicon sequencing analysis pipeline. Zenodo. *DOI:* -2. Diamond - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176 +2. DIAMOND - Buchfink B, Xie C, Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59-60. doi:10.1038/nmeth.3176 3. FastQC - Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ @@ -62,7 +58,7 @@ If you do use vAMPirus for your analyses, please cite the following -> 7. ModelTest-NG - Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B., & Flouri, T. (2020). ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Molecular biology and evolution, 37(1), 291-294. -8. MAFFT - Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780. +8. muscle v5 - R.C. Edgar (2021) "MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping" https://www.biorxiv.org/content/10.1101/2021.06.20.449169v1.full.pdf 9. vsearch - Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. @@ -78,22 +74,24 @@ If you do use vAMPirus for your analyses, please cite the following -> 15. UNOISE algorithm - R.C. Edgar (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257 +16. Oligotyping - A. Murat Eren, Gary G. Borisy, Susan M. Huse, Jessica L. Mark Welch (2014). Oligotyping analysis of the human oral microbiome. Proceedings of the National Academy of Sciences Jul 2014, 111 (28) E2875-E2884; DOI: 10.1073/pnas.1409644111 + # Getting started with vAMPirus -## Order of operations +## General order of operations 1. Clone vAMPirus from github -2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases +2. Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases (NOTE: You will need to have the xz program installed before running startup script when downloading the RVDB database) -3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see STARTUP_HELP.txt for test commands and other examples) +3. Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see EXAMPLE_COMMANDS.txt in the vAMPirus directory for test commands and other examples) 4. Edit parameters in vampirus.config file -5. Launch the DataCheck pipeline to get summary information about your dataset +5. Launch the DataCheck pipeline to get summary information about your dataset (e.g. sequencing success, read quality information, clustering behavior of ASV or AminoTypes) -6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length) +6. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length, Shannon entropy analysis results) 7. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset @@ -108,7 +106,7 @@ All you will need to do is set up the subsystem with whatever flavor of Linux yo Search for Linux in the Microsoft store -> https://www.microsoft.com/en-us/search?q=linux -It should be noted that vAMPirus was developed on Centos7/8. +It should be noted that vAMPirus has been mainly tested on Centos7/8. Here are some brief instructions for setting up Ubuntu 20.04 LTS on Windows 10 sourced from https://ubuntu.com/tutorials/ubuntu-on-windows#1-overview @@ -152,7 +150,7 @@ Next we will install Java to be able to run Nextflow - sudo apt -y install openjdk-8-jre -Now you have everything you need to get started as described in the * [Installing vAMPirus](#Installing-vAMPirus) section. However, here are some quick commands to install and set up Conda. +Now you have everything you need to get started as described in the * [Installing vAMPirus](#Installing-vAMPirus) section. However, here are some quick commands to install and set up Miniconda. NOTE=> Windows WSL currently can not run Singularity so you will have to install and run vAMPirus with Conda. @@ -175,14 +173,14 @@ You can check/confirm you have conda ready to go -> conda init -Once you have have your Conda ready, you can execute the vAMPirus startup script to install Nextflow and build the vAMPirus conda environment. +Once you have your Conda ready, you can execute the vAMPirus startup script to install Nextflow and build the vAMPirus conda environment. ## MacOS users If you plan to run vAMPirus on a Mac computer, it is recommended that you set up a virtual environment and use Singularity with the vAMPirus Docker image to run your analyses. -You can try to run directly on your system, but there may be errors caused by differences between Apply and GNU versions of tools like "sort". +You can try to run directly on your system, but there may be errors caused by differences between Apple and GNU versions of tools like "sort". vAMPirus was developed on a Centos7/8 operating system so we will go through how to set up a Centos7 Vagrant virtual environment with Virtual Box. @@ -213,7 +211,7 @@ Once done with the full installation, execute: (Information from https://treehouse.github.io/installation-guides/mac/homebrew) -Now we should be good to use Homebrew to install Vagrant and VirtualBox to set up the VPN. +Now we should be good to use Homebrew to install Vagrant and VirtualBox to set up the VM. #### Install Vagrant and Virtual Box @@ -244,7 +242,7 @@ Virtual Box (WILL CAUSE ERROR IF ORACLE NOT GIVEN PERMISSION TO INSTALL DEPENDEN NOTE=> In this part of the setup, you might get an error saying Oracle was denied permission to install programs. You will need to go to System Preferences->Security and Privacy->General and allow Oracle permission to download programs and then rerun the above command. -Alright, if you notice no errors during installation, you should be good to go and create the Centos 7 environment +Alright, if you notice no errors during installation, you should be good to go and create the Centos7 environment #### Building and starting your virtual environment @@ -281,6 +279,7 @@ We will make our own that looks like this: yum -y install epel-release yum -y install htop yum -y install nano + yum -y install xz git clone https://github.com/Aveglia/vAMPirus.git yum -y install singularity SHELL @@ -317,7 +316,7 @@ You should now be in your fresh Centos7 virtual environment and if you ls you wi You can now follow the normal directions for setting up vAMPirus with singularity. -But here is the quick overview of recommended next steps: +But here is the quick overview of recommended next steps (without database/taxonomy install): cd ./vAMPirus; bash vampirus_startup.sh -s @@ -327,7 +326,7 @@ After running the above you should now have Nextflow installed. Now, build the S then test the Analyze pipeline with: - ./nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats + ./nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats Please check out http://sourabhbajaj.com/mac-setup/Vagrant/README.html and https://www.vagrantup.com/docs/providers/virtualbox for understanding how to use Vagrant commands like "halt", "suspend" or "reload" @@ -393,7 +392,7 @@ You will also need to decide if you plan to use a container engine like Docker ( The startup script provided in the vAMPirus program directory will install Conda for you if you tell it to (see below), however, you will need to install Docker or Singularity separately before running vAMPirus. -## Setting up vAMPirus dependencies and checking installation +### Setting up vAMPirus dependencies and checking installation To set up and install vAMPirus dependencies, simply move to the vAMPirus directory and run the vampirus_startup.sh script. @@ -412,30 +411,33 @@ You can also use the startup script to install different databases to use for vA 2. The proteic version of the Reference Viral DataBase (RVDB) (See https://f1000research.com/articles/8-530) 3. The complete NCBI NR protein database -To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option. -If we look at the script usage: +To use the vampirus_startup.sh script to download any or all of these databases listed above you just need to use the "-d" option and you can download the NCBI taxonomy files with the option "-t" (See below). - General execution: +If we take a look at the vampirus_startup.sh script usage: - vampirus_startup.sh -h [-d 1|2|3|4] [-s] +General execution: - Command line options: +vampirus_startup.sh -h [-d 1|2|3|4] [-s] [-t] - [ -h ] Print help information + Command line options: - [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment: + [ -h ] Print help information - 1 - Download the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530) - 2 - Download only NCBIs Viral protein RefSeq database - 3 - Download only the complete NCBI NR protein database - 4 - Download all three databases + [ -d 1|2|3|4 ] Set this option to create a database directiory within the current working directory and download the following databases for taxonomy assignment: - [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container) + 1 - Download only the proteic version of the Reference Viral DataBase (See the paper for more information on this database: https://f1000research.com/articles/8-530) + 2 - Download only NCBIs Viral protein RefSeq database + 3 - Download only the complete NCBI NR protein database + 4 - Download all three databases + [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container) -For example, if you would like to install Nextflow, download NCBIs Viral protein RefSeq database, and check/install conda, run: + [ -t ] Set this option to download NCBI taxonomy files needed for DIAMOND to assign taxonomic classification to sequences (works with NCBI type databases only, see manual for more information) - bash vampirus_startup.sh -d 2 + +For example, if you would like to install Nextflow, download NCBIs Viral Protein RefSeq database, the NCBI taxonomy files to use DIAMOND taxonomy assignment feature, and check/install conda, run: + + bash vampirus_startup.sh -d 2 -t and if we wanted to do the same thing as above but skip the Conda check/installation, run: @@ -443,29 +445,37 @@ and if we wanted to do the same thing as above but skip the Conda check/installa NOTE -> if you end up installing Miniconda3 using the script you should close and re-open the terminal window after everything is completed. +**NEW in version 2.0.0 -> the startup script will automatically download annotation information from RVDB to infer Lowest Common Ancestor (LCA) information for hits during taxonomy assignment. You can also use "-t" to download NCBI taxonomy files to infer taxonomy using the DIAMOND taxonomy classification feature. + ### Databases -It should be noted, that any protein database can be used, but it needs to be in fasta format and the headers for reference sequences need to match -one of two patterns: +Any protein database can be used while running vAMPirus, however, it needs to be in fasta format and the headers for reference sequences need to match one of two patterns: -RVDB format (default) -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]" +RVDB format -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]" NCBI NR/RefSeq format -> ">KJX92028.1 hypothetical protein TI39_contig5958g00003 [Zymoseptoria brevis]" -During Taxonomy Inference, vAMPirus infers results by extracting the information stored in the reference sequence headers. If the database sequence headers do not match these -patterns, you are bound to see errors in the naming of files created during the Taxonomy Inference phase of vAMPirus. +To set/inform vAMPirus of which header format for the reference database is being used, you can edit the vampirus.config file at line 122 "dbtype="NCBI"" for NCBI header format or "dbtype="RVDB"" for RVDB format. + +An example of custom headers in RVDB format if you plan to use a custom database: -The default is that vAMPirus assumes that the database headers are in RVDB format, to change this assumption, you would need to edit the configuration file at line 78 where "refseq=F". You could also signal the use of RefSeq format headers within the launch command with adding "--refseq T". + `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 1]` + AMINOACIDSEQUENCE + `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 2]` + AMINOACIDSEQUENCE + `>acc|Custom|VP2000.1|Custom|VP2000| capsid protein [T7 phage isolate]` + AMINOACIDSEQUENCE -An example of custom headers if you plan to use a custom database: +Or in NCBI format the same sequences would be: + + `>VP100000.1 capsid protein [T4 Phage isolate 1]` + AMINOACIDSEQUENCE + `>VP100000.1 capsid protein [T4 Phage isolate 2]` + AMINOACIDSEQUENCE + `>VP2000.1 capsid protein [T7 phage isolate]` + AMINOACIDSEQUENCE - `>acc|Custom|VP100000.1|Custom|VP100000|capsid protein [T4 Phage isolate 1]` - AMINOACIDSEQUENCE - `>acc|Custom|VP100000.2|Custom|VP100000|capsid protein [T4 Phage isolate 2]` - AMINOACIDSEQUENCE - `>acc|Custom|VP2000.1|Custom|VP2000| capsid protein [T7 phage isolate]` - AMINOACIDSEQUENCE ### Using Singularity @@ -506,7 +516,7 @@ Using the yum package manager -> After running the startup script, you can then test the vAMPirus installation with the supplied test dataset. -The startup script will generate a text file (STARTUP_HELP.txt) that has instructions and example commands to test the installation. +The startup script will generate a text file (EXAMPLE_COMMANDS.txt) that has instructions and example commands to test the installation. NOTE => If using Singularity, when you run the test command calling for singularity (-profile singularity) Nextflow will set up the dependencies. @@ -530,11 +540,11 @@ OR Analyze test => - /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats + /path/to/nextflow run /path/to/vAMPirus.nf -c /path/to/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats OR - nextflow run vAMPirus.nf -c vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats + nextflow run vAMPirus.nf -c /path/to/vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --asvMED --aminoMED --stats ### Resuming test analyses if you ran into an error @@ -576,7 +586,7 @@ In the command above, there are five necessary pieces of information needed to s 2. Second, you must tell Nextflow to "run" the vAMPirus program which is described in the "vAMPirus.nf" file. Depending on where you plan to submit this command, you may have to specify the path to the vAMPirus.nf file or you can copy the file to your working directory. -3. Next, we need to tell Nextflow what configuration file we would like to use for the vAMPirus run which is illustrated by the "-c vampirus.config" segment of the command. +3. Next, we need to tell Nextflow what configuration file we would like to use for the vAMPirus run which is illustrated by the "-c vampirus.config" segment of the command. NOTE: config file can be called "anything".config. 4. The next piece of information Nextflow needs is whether you plan to use the vAMPirus conda environment or the vAMPirus Docker container with singularity. @@ -586,7 +596,7 @@ Now that we have an understanding on how to deploy vAMPirus with Nextflow, let's ## The Nextflow monitoring screen -When submitting a launch command to start your vAMPirus run, Nextflow will spit out something that looks like this: +When submitting a launch command to start your vAMPirus run, Nextflow will spit out something that looks like this (example below from older verions): executor > local (57) [8a/75e048] process > Build_database [100%] 1 of 1 ✔ @@ -607,15 +617,33 @@ When submitting a launch command to start your vAMPirus run, Nextflow will spit [26/1143ba] process > combine_csv_DC [100%] 1 of 1 ✔ [2e/e5fea3] process > Report_DataCheck [100%] 1 of 1 ✔ -Nextflow allows for interactive monitoring of submitted workflows, so in this example, we see the left column containing working directories for each process being executed, next to that we see the process name, and the final column on the right contains the status and success of each process. In this example each process has been executed successfully and has been cached. The amazing thing about Nextflow is that say you received an error or decide you would like to change a parameter/add a type of clustering you can use the Nextflow "-resume" option that means you don't re-run already completed processes. +Nextflow allows for interactive monitoring of submitted workflows, so in this example, we see the left column containing working directories for each process being executed, next to that we see the process name, and the final column on the right contains the status and success of each process. In this example each process has been executed successfully and has been cached. + +You can also remotely monitor a run using Nextflow tower (see tower.nf) which will allow you to monitor your run (and even launch new runs) from a portal on your browser. + +## The Nextflow "-resume" feature + +The amazing thing about Nextflow is that it caches previously run processes done with the same samples. For example, in the case that you received an error during a run or run out of walltime on an HPC, you can just add "-resume" to your Nextflow launch command like so: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda,singularity] --Analyze|DataCheck -resume + +Nextflow will then pick up where the previous run left off. + +You can even use this feature to change a parameter/add a type of clustering or add Minimum Entropy Decomposition analysis, you would just rerun the same command as above with minor changes: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda,singularity] --Analyze|DataCheck --ncASV --asvMED -resume + +With this command, you will then add nucleotide-level clustering of ASVs and Minimum Entropy Decomposition analyses to your results directory, all without rerunning any processes. ## Understanding the vAMPirus config file and setting parameters ### The configuration file (vampirus.config) -Nextflow deployment of vAMPirus relies on the use of the configuration file (vampirus.config) that is found in the vAMPirus program directory. The configuration file is a great way to store parameters/options used in your analyses. It also makes it pretty easy to set and keep track of multiple parameters as well as storing custom default values that you feel work best for your data. You can also have multiple copies of vAMPirus configuration files with different parameters, you would just have to specify the correct file with the "-c" argument shown in the section before. +Nextflow deployment of vAMPirus relies on the use of the configuration file (vampirus.config - can be renamed to anything as long as its specified in the launch command) that is found in the vAMPirus program directory. The configuration file is a great way to store parameters/options used in your analyses. It also makes it pretty easy to set and keep track of multiple parameters as well as storing custom default values that you feel work best for your data. You can also have multiple copies of vAMPirus configuration files with different parameters, you would just have to specify the correct file with the "-c" argument shown in the section before. + +Furthermore, the configuration file contains analysis-specific parameters AND resource-specific Nextflow launching parameters. A benefit of Nextflow integration, is that you can run the vAMPirus workflow on a large HPC just as easily as you could on your local machine. -Furthermore, the configuration file contains analysis-specific parameters AND resource-specific Nextflow launching parameters. A benefit of Nextflow integration, is that you can run the vAMPirus workflow on a large HPC just as easily as you could on your local machine. If you look at line 151 and greater in the vampirus.config file, you will see resource-specific parameters that you can alter before any run. Nexflow is capable of submitting jobs automatically using slurm and PBS, check out the Nextflow docs to learn more (https://www.nextflow.io/docs/latest/executor.html)! +If you look at line 233 and greater in the vampirus.config file, you will see resource-specific parameters that you can alter before any run. Nexflow is capable of submitting jobs automatically using slurm and PBS, check out the Nextflow docs to learn more (https://www.nextflow.io/docs/latest/executor.html)! ### Setting parameter values @@ -625,22 +653,20 @@ There are two ways to set parameters with Nextflow and vAMPirus: Here we have a block from the vampirus.config file that stores information related to your run: - // Project/analyses- specific information + // Project specific information + // Project name - Name that will be used as a prefix for naming files by vAMPirus - projtag="vAMPrun" + projtag="vAMPirusAnalysis" // Path to metadata spreadsheet file to be used for plot - metadata="/PATH/TO/metadata.csv" - // Minimum number of hit counts for a sample to have to be included in the downstream analyses and report generation - minimumCounts="1000" - // PATH to current working directory - mypwd="/PATH/TO/working_directory" - email="your_email@web.com" - // reads directory - reads="/PATH/TO/reads/R{1,2}_001.fastq.gz" - // Directory to store output of vAMPirus analyses + metadata="/PATH/TO/vampirus_meta.csv" + // reads directory, must specify the path with "R{1,2}" for reads to be properly read by Nextflow + reads="/PATH/TO/reads/" + // PATH to working directory of your choosing, will automatically be set to vAMPirus installation + workingdir="VAMPDIR" + // Name of directory created to store output of vAMPirus analyses (Nextflow will create this directory in the working directory) outdir="results" -The first one in the block is the project tag or "projtag" which by default, if unchanged, will use the prefix "vAMPrun". To change this value, and any other parameter value, just edit right in the configuration file so if you wanted to call the run "VirusRun1" you would edit the line to: +The first one in the block is the project tag or "projtag" which by default, if unchanged, will use the prefix "vAMPirusAnalysis". To change this value, and any other parameter value, just edit right in the configuration file so if you wanted to call the run "VirusRun1" you would edit the line to: // Project/analyses- specific information // Project name - Name that will be used as a prefix for naming files by vAMPirus @@ -649,19 +675,19 @@ The first one in the block is the project tag or "projtag" which by default, if 2. Set the value within the launch command itself: -Instead of editing the configuration file directly, you could set parmeters within the launching command itself. So, for example, if we wanted to run the analysis with nucletide-based clustering of ASVs at 95% similarity, you would do so like this: +Instead of editing the configuration file directly, you could set parameters within the launching command itself. So, for example, if we wanted to run the analysis with nucleotide-based clustering of ASVs at 95% similarity, you would do so like this: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --clusterNuclID .95 -Here we use the "--Analyze" option that tells vAMPirus that we are ready to analyze soem data. Then the "--ncASV" argument with the "--clisterNuclID .95" tells vAMPirus we would like to cluster our ASVs based on 95% nucleotide similarity. The default ID value is stored at line 51 in the vampirus.config file (currently 85%), but as soon as you specify and provide a value in the command, the default value is overwritten. +Here we use the "--Analyze" option that tells vAMPirus that we are ready to analyze some data. Then the "--ncASV" argument with the "--clisterNuclID .95" tells vAMPirus we would like to cluster our ASVs based on 95% nucleotide similarity. The default ID value is stored at line 66 in the vampirus.config file (currently 85%), but as soon as you specify and provide a value in the command, the value within the config file is ignored. -NOTE: Nextflow also has options in the launch command. To tell them apart, Nextflow options uses a single dash (e.g. -with-conda) while vAMPirus options are always with a double dash (e.g. --Analyze) +NOTE: Nextflow also has options in the launch command. To tell them apart, Nextflow options uses a single dash (e.g. -with-conda or -profile) while vAMPirus options are always with a double dash (e.g. --Analyze) -### Setting computing resource parameters - Edit in lines 151-171 in vampirus.config +### Setting computing resource parameters - Edit in lines 241-261 in vampirus.config Each process within the vAMPirus workflow is tagged with either "low_cpus", "norm_cpus", or "high_cpus" (see below) which let's Nextflow know the amount of cpus and memory required for each process, which will then be used for when Nextflow submits a given job or task. Nexflow actively assesses the amount of available resources on your machine and will submit tasks only when the proper amount of resources can be requested. -From line 203-217 in the vAMPirus.config file is where you can edit these values for whichever machine you plan to run the workflow on. +From line 241-261 in the vAMPirus.config file is where you can edit these values for whichever machine you plan to run the workflow on. process { withLabel: low_cpus { @@ -694,7 +720,7 @@ As stated before, you can launch vAMPirus on either your personal laptop OR a la To specify certain parts of the vAMPirus workflow to perform in a given run, you can use skip options to have vAMPirus ignore certain processes. Here are the current skip options you can specify within the launch command: // Skip options - // Skip all Read Processing + // Skip all Read Processing steps skipReadProcessing=false // Skip quality control processes only skipFastQC = false @@ -712,6 +738,8 @@ To specify certain parts of the vAMPirus workflow to perform in a given run, you skipEMBOSS = false // Skip Reports skipReport = false + // Skip Merging steps + skipMerging = false To utilize these skip options, just add it to the launch command like so: @@ -734,7 +762,7 @@ With this launch command, vAMPirus will perform ASV generation and nucleotide-ba Once you have everything set up and you have edited the parameters of interest in your configuration file you can run the following launch command for a full analysis: - nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats This launch command will run all aspects of the vAMPirus workflow on your data and spit out final reports for each clustering %ID and technique. @@ -742,7 +770,7 @@ This launch command will run all aspects of the vAMPirus workflow on your data a ### Sequencing reads -Input can be raw or processed compressed or non-compressed fastq files with names containing "\_R1" or "\_R2". You can specify the directory containing your reads in line 25 of the vampirus.config file. +Input can be raw or processed compressed or non-compressed fastq files with names containing "\_R1" or "\_R2". You can specify the directory containing your reads in line 20 of the vampirus.config file. NOTE: Sample names are extracted from read library names by using the string to the left of the "\_R" in the filename automatically. @@ -781,7 +809,7 @@ Usage example: The DataCheck feature of vAMPirus is meant to give the user some information about their data so they can tailor their final analysis appropriately. In DataCheck mode, vAMPirus performs all read processing operations then generates ASVS and performs nucleotide- and protein-based clustering at 24 different clustering percentages ranging from 55-99% ID. vAMPirus then generates an html report that displays and visualizes read processing and clustering stats. It is recommended that before running any dataset through vAMPirus, you run the data through the DataCheck. -Here is how Nextflow will display the list of processes vAMPirus will execute during DataCheck (executed with the launch command above): +Here is how Nextflow will display the list of processes vAMPirus will execute during DataCheck (executed with the launch command above; below is an example from an older version of vAMPirus): executor > local (57) [8a/75e048] process > Build_database [100%] 1 of 1 ✔ @@ -805,6 +833,8 @@ Here is how Nextflow will display the list of processes vAMPirus will execute du Every time you launch vAMPirus with Nextflow, you will see this kind of output that refreshes with the status of the different processes during the run. +**Add "--asvMED" or "aminoMED" to the launch command above to get Shannon Entropy analysis resutls for ASVs and AminoTypes + 2. "--Analyze" @@ -901,6 +931,7 @@ Here is what the Nextflow output would look like for this launch command: You can see that there are a few more processes now compared to the output of the previous launch command which is what we expect since we are asking vAMPirus to do a little bit more work for us :). + # Breaking it down: The vAMPirus workflow ## Read processing @@ -921,21 +952,21 @@ This is the default action of vAMPirus if no primer sequences are provided, to s nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --GlobTrim 23,26 -The command above is telling vAMPirus to have bbduk.sh remove primers by trimming 23 bases from the forward reads and 26 bases from the reverse reads. The other way to initiate this method of primer removal is to add the same information at line 38 in the configuration file: +The command above is telling vAMPirus to have bbduk.sh remove primers by trimming 23 bases from the forward reads and 26 bases from the reverse reads. The other way to initiate this method of primer removal is to add the same information at lime 38 in the configuration file: // Primer Removal parameters // If not specifying primer sequences, forward and reverse reads will be trimmed by number of bases specified using --GlobTrim #basesfromforward,#basesfromreverse GlobTrim="23,26" -By adding the information to line 38, vAMPirus will automatically use this method and these parameters for primer removal until told otherwise. +By adding the information to lime 38, vAMPirus will automatically use this method and these parameters for primer removal until told otherwise. If you want to change the number of bases without editing the configuration file, all you would need to do is then specify in the launch command with "--GlobTrim 20,27" and vAMPirus will ignore the "23,26" in the configuration file. -NOTE: Specifying global trimming by editing line 38 in the config file or using "--GlobTrim" in the launch command will also override the use of primer sequences for removal if both are specified +NOTE: Specifying global trimming by editing lime 38 in the config file or using "--GlobTrim" in the launch command will also override the use of primer sequences for removal if both are specified 2. Primer removal by specifying primer sequences - -You can tell vAMPirus to have bbduk.sh search for and remove either a single primer paire or multiple. +You can tell vAMPirus to have bbduk.sh search for and remove either a single primer pair or multiple. In the case where you are using a single primer pair, similar to the previous method, you could edit the configuration file or specify within the launch command. @@ -952,11 +983,11 @@ The primer sequences could also be stored in the configuration file in lines 43- // Reverse primer sequence rev="REVPRIMER" -If you have multiple primer sequences to be removed, all you need to do is provide a fasta file with your primer sequences and signla multiple primer removal using the "--multi" option like so: +If you have multiple primer sequences to be removed, all you need to do is provide a fasta file with your primer sequences and signal multiple primer removal using the "--multi" option like so: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --multi --primers /path/to/primers.fa -You can set the path to the primer sequence fasta file within the launch command above or you can have it in the configuration file at line 44: +You can set the path to the primer sequence fasta file within the launch command above or you can have it in the configuration file at lime 44: // Path to fasta file with primer sequences to remove (need to specify if using --multi option ) primers="/PATH/TO/PRIMERS.fasta" @@ -975,26 +1006,31 @@ There are also a few other options you can change to best match what your data w ### Read merging and length filtering -Read merging in the vAMPirus workflow is performed by vsearch and afterwards, reads are trimmed to the expected amplicon length (--maxLen) and any reads with lengths below the user specified minimum read length (--minLen). There are three parameters that you can edit to influence this segment of vAMPirus. If we look at lines 26-33: +Read merging in the vAMPirus workflow is performed by vsearch and afterwards, reads are trimmed to the expected amplicon length (--maxLen) and any reads with lengths below the user specified minimum read length (--minLen) are discarded. There are five parameters that you can edit to influence this segment of vAMPirus. If we look at lines 26-33: - // Merged read length filtering parameters - // Minimum merged read length - reads below the specified maximum read length will be used for counts only - minLen="400" - // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs - maxLen="422" - // Maximum expected error for vsearch merge command - maxEE="1" + // Merged read length filtering parameters + + // Minimum merged read length - reads with lengths greater than minLen and below the specified maximum read length will be used for counts only + minLen="400" + // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs (safe to set at expected amplicon size to start) + maxLen="420" + // Maximum expected error for vsearch merge command - vsearch discard sequences with more than the specified number of expected errors + maxEE="3" + // Maximum number of non-matching nucleotides allowed in overlap region + diffs="20" + // Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded + maxn="20" The user can edit the minimum length (--minLen) for reads to be used for counts table generation, maximum length (--maxLen) for reads used to generate uniques and subsequent ASVs, and the expected error rate (--maxEE) for overlapping region of reads during read merging with vsearch. The values above are default and should be edited before running your data with --Analyze. This is where the DataCheck report is very useful, you can review the report and see the number of reads that merge per library and you can edit the expected error value to be less stringent if needed. The DataCheck report also contains a read length distribution that you can use to select an ideal maximum/minimum read length. -## Amplicon Sequence Variants, AminoTypes and Operational Taxonomic Units +## Amplicon Sequence Variants, AminoTypes and Clustering The goal of vAMPirus was to make it easy for the user to analyze their data is many different ways to potentially reveal patterns that would have been missed if limited to one method/pipeline. -A major and sometimes difficult step in analyzing virus amplicon sequence data is deciding the method to use for identifying or defining different viral "species" in the data. To aid this process, vAMPirus has the DataCheck mode discussed above and has several different options for sequence clustering/analysis for the user to decide between. +A major, and sometimes difficult, step in analyzing virus amplicon sequence data is deciding the method to use for identifying or defining different viral "species" in the data. To aid this process, vAMPirus has the DataCheck mode discussed above and has several different options for sequence clustering/analysis for the user to decide between. vAMPirus relies on vsearch using the UNOISE3 algorithm to generate Amplicon Sequencing Variants (ASVs) from dereplicated amplicon reads. ASVs are always generated by default and there are two parameters that the user can specify either in the launch command or by editing the configuration file at lines 45-49: @@ -1010,14 +1046,47 @@ Launch command to produce only ASV-related analyses: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats --skipAminoTyping -Now, onto clustering ASVs into clustered ASVs (cASVs). vAMPirus is able to use two different techniques for generating cASVs for the user: +### ASV filtering (experimental) + +New to version 2 you can now filter ASVs to remove sequences that belong to taxonomic groups that are not of interest for a given run. + +A great example of when this feature is useful is Prodinger et al. 2020 (https://www.mdpi.com/2076-2607/8/4/506). In this study they looked to amplify and analyze Mimiviridae polB sequences, however, polB is also found in cellular genomes like bacteria. In this case, Prodinger et al. looked to avoid including any bacterial polB in their final results and thus used a filtering step to remove microbial sequences. The ASV filtering feature can be used to do exactly this type of filtering where you provide paths to a "filter database" containing sequences belonging to non-target groups (e.g. microbial polB) and a "keep database" containing sequences belonging to the target group (e.g. Mimiviridae polB). Any ASVs that match non-target sequences will then be filtered from the ASV file prior to running the DataCheck or Analyze pipeline. + +Here are the options stored within the configuration file: + + // ASV filtering parameters - You can set the filtering to run with the command --filter + + + // Path to database containing sequences that if ASVs match to, are then removed prior to any analyses + filtDB="" + // Path to database containing sequences that if ASVs match to, are kept for final ASV file to be used in subsequent analyses + keepDB="" + // Keep any sequences without hits - for yes, set keepnohit to ="true" + keepnohit="true" + + //Parameters for diamond command for filtering + + // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment + filtminID="80" + // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment + filtminaln="30" + // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive) + filtsensitivity="ultra-sensitive" + // Set the max e-value for best hit to be recorded + filtevalue="0.001" + + + +### Clustering options + +Dependining on the virus type/marker gene, ASV-level results can be noisy and to combat this vAMPirus has three different approaches to clustering ASV sequences: 1. AminoTyping - vAMPirus by default, unless the --skipAminoTyping option is set, will generate unique amino acid sequences or "AminoTypes" from generated ASVs. These AminoTypes, barring any skip options set, will run through all the same analyses as ASVs. -vAMPirus will translate the ASVs with Virtual Ribosome and relies on the user to specify the expected or minimum amino acid sequence length (--minAA) to be used for AminoTyping and pcASV generation (discussed below). For example, if you amplicon size is ~422 bp long, you would expect the amino acid translations to be ~140. Thus, you would either edit the --minAA value to be 140 in the configuration file (line 69) or in the launch command. +vAMPirus will translate the ASVs with Virtual Ribosome and relies on the user to specify the expected or minimum amino acid sequence length (--minAA) to be used for AminoTyping and pcASV generation (discussed below). For example, if you amplicon size is ~422 bp long, you would expect the amino acid translations to be ~140. Thus, you would either edit the --minAA value to be 140 in the configuration file (lime 69) or in the launch command. You can make it shorter if you would like, but based on personal observation, a shorter translation is usually the result of stop codons present which would usually be removed from subsequent analyses. If there are any sequences below the minimum amino acid sequence length the problematic sequence(s) and its translation will be stored in a directory for you to review. @@ -1071,10 +1140,55 @@ Example launch command: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --pcASV --clusterAAIDlist .85,.90,.96 --stats +## Minimum Entropy Decomposition (EXPERIMENTAL) - Oligotyping - https://merenlab.org/2012/05/11/oligotyping-pipeline-explained/ + +In vAMPirus v2, we added the ability for the user to use the oligotyping program employing the Minimum Entropy Decomposition (MED) algorithm developed by Eren et al. 2015 (read more about MED here - https://www.nature.com/articles/ismej2014195#citeas) to cluster ASV or AminoType sequences. + +The MED algorithm provides an alternative way of clustering marker gene sequences using "information theory-guided decomposition" - "By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation." -Eren et al. 2015 + +When you run the DataCheck pipeline with your dataset, the report will include a figure and table that breakdown the Shannon Entropy analysis results for both ASVs and AminoTypes. The figure visualizes entropy values per sequence position revealing positions or regions of high entropy. The table beneath the figure breaks down the number of positions with entropy values above "0.x". Although, if you know the positions on your sequence that have the potential to contain biologically or ecologically meaningful mutations, you can specify decomposition based on these positions. + +If you decide to use MED, vAMPirus will run all the same analyses that would be done with the ASV or AminoType sequences (e.g. diversity analyses, statistics) and be appended results to the ASV or AminoType report. The ASV or AminoType sequence nodes on the phylogenetic tree will also be colored based on which MED group they were assigned to. + +To add MED analysis to either the DataCheck or Analyze run you must add "--asvMED" and/or "--aminoMED" to the launch command (see examples below). + +There are two ways to utilize MED within the vAMPirus pipeline: + + (1) Decomposition based on all sequence positions with an entropy value above "0.x" - Useful approach to preliminarily test influence of MED on your sequences + + Example -> Entropy value table from the DataCheck report shows I have 23 ASV sequence positions that have Shannon entropy values above 0.1 and I would like to oligotype using all of these high entropy positions. + + To use these 23 positions for MED clustering of ASVs, all I need to do is add the options "--asvMED" (signals use of MED on ASV sequences) and "--asvC 23" (specifies the number of high entropy positions to be used for MED - could also be done by editing "asvC="23"" in config file) to the launch command: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --asvMED --asvC 23 + + After this run completes successfully and you move the ASV report file to a safe area for review, if I wanted to see what happens if just the top 5 high entropy positions were used instead, I can use the "-resume" option and run: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --asvMED --asvC 5 -resume + + + (2) Decomposition based on specific sequence positions that may contain biologically/ecologically meaningful differences. + + Example -> I know that amino acid differences at certain positions on my AminoTypes are ecologically meaningful (e.g. correlate with host range) and I would like to perform MED with these positions only. + + To do this, similar to the example above, I will add "--aminoMED" and "--aminoC" to the launch command: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --aminoMED --aminoC 2,3,4,25,34,64 -resume + +MED related options within the configuration file: + + // Minimum Entropy Decomposition (MED) parameters for clustering (https://merenlab.org/2012/05/11/oligotyping-pipeline-explained/) + + // If you plan to do MED on ASVs using the option "--asvMED" you can set here the number of entropy peak positions or a comma separated list of biologically meaningful positons (e.g. 35,122,21) for oligotyping to take into consideration. vAMPirus will automatically detect a comma separated list of positions, however, if you want to use a single specific position, make "asvSingle="true"". + asvC="" + asvSingle="" + // If you plan to do MED on ASVs using the option "--aminoMED" you can set here the number of entropy peak positions or a comma separated list of biologically meaningful positons (e.g. 35,122,21) for oligotyping to take into consideration. vAMPirus will automatically detect a comma separated list of positions, however, if you want to use a single specific position, make "aminoSingle="true"". + aminoC="" + aminoSingle="" ## Counts tables and percent ID matrices -vAMPirus generate nucleotide-based counts tables using vsearch and protein-based counts tables using DIAMOND and a custom bash script. Counts tables and percent ID matrices are always produced for each ASV, AminoType and all cASV fasta files produced. +vAMPirus generates nucleotide-based counts tables using vsearch and protein-based counts tables using DIAMOND and a custom bash script. Counts tables and percent ID matrices are always produced for each ASV, AminoType and all cASV fasta files produced. Here are the parameters you can edit at lines 61-70: @@ -1093,13 +1207,19 @@ The "--asvcountID" is the percent ID during global alignment that vsearch would Protein-based counts file generation has a few more parameters the user can alter: "--ProtCountsBit" is the minimum bitscore for an alignment to be recorded, "--ProtCountID" is the minimum percent amino acid similarity an alignment needs to have to be recorded, and "--ProtCountsLength" is the minimum alignment length for a hit to be recorded. - ## Phylogenetic analysis and model testing -Phylogenetic trees are produced automatically for ASVs (unless --ncASV specified), ncASVs, pcASVs and AminoTypes using IQ-TREE. All produced sequence fastas are aligned using the MAFFT algorithm then alignments are trimmed automatically using TrimAl. +Phylogenetic trees are produced automatically for ASVs (unless --ncASV specified), ncASVs, pcASVs and aminotypes using IQ-TREE. All produced sequence fastas are aligned using the MAFFT algorithm then alignments are trimmed automatically using TrimAl. Post alignment and trimming, there is some flexibility in this process where you can specify a few different aspects of the analysis: +### Coloring nodes on produced trees in the Analyze report + +You can tell vAMPirus to color nodes on produced phylogenies based on taxonomy or Minimum Entropy Decomposition Group ID. Edit the option below in the config file. + +// Color nodes on phylogenetic tree in Analyze report with MED Group information (nodeCol="MED") or taxonomy (nodeCol=TAX) hit. If you would like nodes colored by sequence ID, leave nodeCol="" below. + nodeCol="" + ### Substitution model testing ModelTest-NG is always ran to determine the best substitution model and all of its output is stored for the users review. @@ -1115,12 +1235,13 @@ By default IQTREE will determine the best model to use with ModelFinder Plus. ### Bootstrapping -IQ-TREE is capable of performing parametric or non-parametric bootstrapping. You can specify which one using "--parametric" or "--nonparametric" and to set how many boostraps to perform, you would use "--boots #ofbootstraps" or edit line 114 in the vampirus.config file. +IQ-TREE is capable of performing parametric or non-parametric bootstrapping. You can specify which one using "--parametric" or "--nonparametric" and to set how many bootstraps to perform, you would use "--boots #ofbootstraps" or edit lime 114 in the vampirus.config file. -Here is an example for creating a tree using the model determined by ModelTest-NG, non-parametric boostrapping and 500 bootstraps: +Here is an example for creating a tree using the model determined by ModelTest-NG, non-parametric bootstrapping and 500 bootstraps: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --pcASV --clusterAAIDlist .85,.90,.96 --ModelTnt --ModelTaa --nonparametric --boots 500 + ### Custom IQ-TREE command The default IQ-TREE command looks like this: @@ -1160,29 +1281,73 @@ AminoType IQTREE command -> iqtree -s aminotype_alighnment.fasta --prefix TestRun --redo -T auto -option1 A -option2 B -option3 C - ## Taxonomy Inference -vAMPirus uses Diamond blastx/blastp and the provided protein database to assign taxonomy to ASVs/cASVs/AminoTypes. There are summary files generated, one as a phyloseq object and the other as a .tsv with information in a different - arrangement. Results are also visualized as a donut graph in the final reports. You can adjust the following parameters: +vAMPirus uses DIAMOND blastx/blastp and the provided protein database to infer taxonomy of amplicons to ASVs/cASVs/AminoTypes. There are summary files generated, one in the format compatible with phyloseq and the other as a .tsv with information in a different arrangement. Results are also visualized as a donut graph in the final reports. + +First, lets take a look at the taxonomy section of the configuration file: + +Block 1 - + + // Taxonomy inference parameters + + //Parameters for diamond command + // Set minimum bitscore for best hit in taxonomy assignment + bitscore="50" + // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment + minID="40" + // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment + minaln="30" + // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive) + sensitivity="ultra-sensitive" + +The first block of options are related to the DIAMOND command. + +Block 2 - + + // Database information + // Specify name of database to use for analysis + dbname="DATABASENAME" + // Path to Directory where database is being stored - vAMPirus will look here to make sure the database with the name provided above is present and built + dbdir="DATABASEDIR" + // Set database type (NCBI or RVDB). Lets vAMPirus know which sequence header format is being used and must be set to NCBI when using RefSeq or Non-Redundant databases. -> dbtype="NCBI" to toggle use of RefSeq header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers (see manual) + dbtype="TYPE" - // Taxonomy inference parameters - // Specify name of database to use for analysis - dbname="DATABASENAME.FASTA" - // Path to Directory where database is being stored - dbdir="/PATH/TO/DATABASE/DIRECTORY" - // Toggle use of RefSeq header format; default is Reverence Viral DataBase (RVDB) - refseq="F" +The second block of options is regarding the database that will be used for the analysis. "dbname" should be the name of the reference fasta file, for example if using the RVDB, dbname would = "U-RVDBv21.0-prot.fasta". The "dbtype" option signals which sequence header format is being used in the reference database. To use the DIAMOND taxonomy assignment feature (see below), you must be using the NCBI style sequence headers. -This was mentioned in an earlier section of the docs, but before running vAMPirus without the "--skipTaxonomy" option set, you should add the name and path of the database you would like to use at lines 74 and 76 in the config file, - respectively. The database, currently, needs to be protein sequences and be in fasta format. The database can be a custom database but for proper reporting of the results, the headers need to follow either RVDB or RefSeq - header formats: +The database, currently, needs to be protein sequences and be in fasta format. The database can be a custom database but for proper reporting of the results, the headers need to follow either RVDB or NCBI/RefSeq header formats: 1. RVDB format (default) -> ">acc|GENBANK|AYD68780.1|GENBANK|MH171300|structural polyprotein [Marine RNA virus BC-4]" 2. NCBI NR/RefSeq format -> ">KJX92028.1 hypothetical protein TI39_contig5958g00003 [Zymoseptoria brevis]" -NOTE: By default, vAMPirus assumes the headers are in RVDB format, to trigger the use of NCBI RefSeq format, edit the "F" to "T" at line 78 in the config file or add "--refseq T" to the launch command. +Block 3 - + + // Classification settings - if planning on inferring LCA from RVDB annotation files OR using NCBI taxonomy files, confirm options below are accurate. + // Path to directory RVDB hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using RVDB LCA. + dbanno="DATABASEANNOT" + // Set lca="T" if you would like to add "Lowest Common Ancestor" classifications to taxonomy results using information provided by RVDB annotation files (works when using NCBI or RVDB databases) - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus" + lca="LCAT" + // DIAMOND taxonomy inference using NCBI taxmap files (can be downloaded using the startup script using the option -t); set to "true" for this to run (ONLY WORKS WITH dbtype="NCBI") + ncbitax="false" + +The third block of options is regarding the two different methods thats could be used to get putative taxonomic classifications for your sequences: + +1. Grabbing "Lowest Common Ancestor" (LCA) information from the annotation files associated with the Reference Virus DataBase (RVDB; https://rvdb-prot.pasteur.fr/). + +By default, these annotation files are downloaded when you use the startup script to download any of the possible three databases. The "dbanno" variable refers to the path to the annotation files, if using the startup script, this will automatically be edited. + +The LCA feature works by searching the RVDB annotation files for the accession number of the aligned-to reference sequence so this feature can be used when "dbtype" equals either "NCBI" or "RVDB". Please note, however, that the aligned-to reference sequence might not be found in the annotation files and some sequences may not have the LCA information available within its annotation file. This is a quick way to assign some degree of classification that might help in future binning or figure making. + +You can turn on this feature of the pipeline by editing lime 121 in the vampirus.config file making "lca="true"" or you can set this in the launch command like so: + + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --stats --lca true + +2. Using the NCBI taxonomy files (https://www.ncbi.nlm.nih.gov/taxonomy ; https://www.ncbi.nlm.nih.gov/books/NBK53758/) and the DIAMOND taxonomy assignment feature. + +You can tell vAMPirus to use the DIAMOND taxonomy assignment feature that leverages the taxonomy identifier (TaxId) of aligned-to reference sequences. It will then add putative taxonomic information to your sequences in the *quick_TaxBreakdown.csv file stored in the vAMPirus results directory. + +NOTE=> To use the DIAMOND taxonomy feature and the NCBI taxonomy files you must be using the NCBI header format. So, this feature will only be used when "dbtype="NCBI"" and "ncbitax="true"" within the configuration file. ## EMBOSS Analyses @@ -1208,19 +1373,15 @@ NOTE=> Be sure that there is more than 1 sample in each treatment category or th # vAMPirus output -There are several files created throughout the vAMPirus pipeline that are stored and organized in directories within the specified results/output directory (ex. ${working_directory}/results; line 27 in the configuration file). We will go through the structure of the output directory and where to find which files here: +There are several files created throughout the vAMPirus pipeline that are stored and organized in directories within the specified results/output directory (ex. ${working_directory}/results; line 24 in the configuration file). We will go through the structure of the output directory and where to find which files here: -## Pipeline performance information - ${working_directory}/results/PipelinePerformance/ +## Pipeline performance information - ${working_directory}/${outdir}/PipelinePerformance/ Nextflow produces a couple files that breakdown how and what parts of the vAMPirus pipeline was ran. The first file is a report that contains information on how the pipeline performed along with other pipeline performance-related information like how much memory or CPUs were used during certain processes. You can use this information to alter how many resources you would request for a given task in the pipeline. For example, the counts table generation process may take a long time with the current amount of resources requested, you can see this in the report then edit the resources requested at lines 144-183 in the vAMPirus configuration file. The second file produced by Nextflow is just the visualization of the workflow and analyses ran as a flowchart. -## Output of "--DataCheck" - ${working_directory}/results/DataCheck - -The DataCheck performed by vAMPirus includes "ReadProcessing", "Clustering", and "Report" generation. Here again is the launch command to run the DataCheck mode: - - `nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck` +## Output of read processing - ${working_directory}/${outdir}/ReadProcessing -### ReadProcessing - ${working_directory}/results/DataCheck/ReadProcessing +### ReadProcessing - ${working_directory}/${outdir}/ReadProcessing Within the ReadProcessing directory you will find all files related to each step of read processing: @@ -1240,7 +1401,13 @@ Similar to the adapter removal directory, here you have the clean read libraries There is a little bit more going on in this directory compared to the others. The first major file to pay attention to here is the file \*\_merged_clean_Lengthfiltered.fastq. This is the "final" merged read file that contains all merged reads from each samples and is used to identify unique sequences and then ASVs. "Pre-filtered" and "pre-cleaned" combined merged read files can be found in "./LengthFiltering". If you would like to review or use the separate merged read files per sample, these fastq files are found in the "./Individual" directory. Finally, a fasta file with unique sequences are found in the "./Uniques" directory and the "./Histograms" directory is full of several different sequence property (length, per base quality, etc.) histogram files which can be visualized manually and reviewed in the DataCheck report. -### Clustering - ${working_directory}/results/DataCheck/Clustering +## Output of "--DataCheck" - ${working_directory}/${outdir}/DataCheck + +The DataCheck performed by vAMPirus includes "ReadProcessing", "Clustering", and "Report" generation. Here again is the launch command to run the DataCheck mode: + + `nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck` + +### Clustering - ${working_directory}/${outdir}/DataCheck/Clustering As the name would suggest, the files within this directory are related to the clustering process of "--DataCheck". There isn't too much in here, but here is the breakdown anyway: @@ -1250,13 +1417,13 @@ In this directory, there is the fasta file with the generated ASV sequences and 2. Nucleotide - ${working_directory}/results/DataCheck/Clustering/Nucleotide -This directory stores a .csv file that shows the number of clusters or ncASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. +This directory stores a .csv file that shows the number of clusters or ncASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. In this directory you will also find Shannon Entropy analysis results files. 3. Aminoacid - ${working_directory}/results/DataCheck/Clustering/Aminoacid -Similar to Nucleotide, the Aminoacid directory contained the .csv that shows the number of clusters or pcASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. +Similar to Nucleotide, the Aminoacid directory contained the .csv that shows the number of clusters or pcASVs per clustering percentage. The file can be visualized manually or can be reviewed in the DataCheck report. In this directory you will also find Shannon Entropy analysis results files. -### Report - ${working_directory}/results/DataCheck/Report +### Report - ${working_directory}/${outdir}/DataCheck/Report In this directory, you will find a .html DataCheck report that can be opened in any browser. The report contains the following information and it meant to allow the user to tailor their vAMPirus pipeline run to their data (i.e. maximum read length, clustering percentage, etc.): @@ -1275,59 +1442,39 @@ In this section of the report, vAMPirus is showing the number of nucleotide- and NOTE: Most, if not all, plots in vAMPirus reports are interactive meaning you can select and zoom on certain parts of the plot or you can use the legend to remove certain samples. -## Output of "--Analyze" - ${working_directory}/results/Analyze +## Output of "--Analyze" - ${working_directory}/${outdir}/Analyze Depending on which optional arguments you add to your analyses (e.g. --pcASV, --ncASV, skip options), you will have different files produced, here we will go through the output of the full analysis stemming from this launch command: nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats -### ReadProcessing - ${working_directory}/results/Analyze/ReadProcessing - -Very similar to the "ReadProcessing" directory created in DataCheck, you will find the following: - -1. FastQC - ${working_directory}/results/Analyze/ReadProcessing/FastQC - -In this directory you will find FastQC html reports for pre-cleaned and post-cleaned individual read libraries. - -2. AdapterRemoval - ${working_directory}/results/Analyze/ReadProcessing/AdapterRemoval - -Here we have resulting fastq files with adapter sequences removed. Fastp also generates its own reports which can also be found in "./fastpOut". - -3. PrimerRemoval - ${working_directory}/results/Analyze/ReadProcessing/PrimerRemoval - -Similar to the adapter removal directory, here you have the clean read libraries that have had adapter and primer sequences removed. - -4. ReadMerging - ${working_directory}/results/Analyze/ReadProcessing/ReadMerging - -There is a little bit more going on in this directory compared to the others. The first major file to pay attention to here is the file \*\_merged_clean_Lengthfiltered.fastq. This is the "final" merged read file that contains all merged reads from each samples and is used to identify unique sequences and then ASVs. "Pre-filtered" and "pre-cleaned" combined merged read files can be found in "./LengthFiltering". If you would like to review or use the separate merged read files per sample, these fastq files are found in the "./Individual" directory. Finally, a fasta file with unique sequences are found in the "./Uniques" directory and the "./Histograms" directory is full of several different sequence property (length, per base quality, etc.) histogram files which can be visualized manually and reviewed in the DataCheck report if ran before Analyze. - -### Clustering - ${working_directory}/results/Analyze/Clustering +### Clustering - ${working_directory}/${outdir}/Analyze/Clustering The clustering directory will contain all files produced for whichever clustering technique you specified (with the launch command above, all are specified): -1. ASVs - ${working_directory}/results/Analyze/Clustering/ASVsIn this directory, there is the fasta file with the generated ASV sequences and there is another directory "./ChimeraCheck" where the pre-chimera filered ASV fasta sits. +1. ASVs -- ${working_directory}/results/Analyze/Clustering/ASVs -- In this directory, there is the fasta file with the generated ASV sequences and there is another directory "./ChimeraCheck" where the pre-chimera filered ASV fasta sits. In this directory, if --asvMED was set to run, you will be a MED/ directory containing all output files from oligotyping analyses. -2. AminoTypes - ${working_directory}/results/Analyze/Clustering/AminoTypesThe AminoTypes directory has a few different subdirectories, in the main directory, however, is the fasta file with the AminoTypes used in all subsequent analyses. The first subdirectory is called "Translation" which includes the raw ASV translation file along with a report spit out by VirtualRibosome. The next subdirectory is "Problematic", where any translations that were below the given "--minAA" length will be reported, if none were deemed "problematic" then the directory will be empty. All problematic amino acid sequence AND their corresponding ASVs are stored in fasta files for you to review. The final subdirectory is "SummaryFiles" where you can find a "map" of sorts to track which ASVs contributed to which AminoTypes and a .gc file containing information on length of translated sequences. +2. AminoTypes -- ${working_directory}/results/Analyze/Clustering/AminoTypes -- The AminoTypes directory has a few different subdirectories, in the main directory, however, is the fasta file with the AminoTypes used in all subsequent analyses. The first subdirectory is called "Translation" which includes the raw ASV translation file along with a report spit out by VirtualRibosome. The next subdirectory is "Problematic", where any translations that were below the given "--minAA" length will be reported, if none were deemed "problematic" then the directory will be empty. All problematic amino acid sequence AND their corresponding ASVs are stored in fasta files for you to review. The final subdirectory is "SummaryFiles" where you can find a "map" of sorts to track which ASVs contributed to which AminoTypes and a .gc file containing information on length of translated sequences. In this directory, if --aminoMED was set to run, you will be a MED/ directory containing all output files from oligotyping analyses. -3. ncASV - ${working_directory}/results/Analyze/Clustering/ncASVIn this directory, you will find the fasta files corresponding to the clustering percentage(s) you specified for the run. +3. ncASV -- ${working_directory}/results/Analyze/Clustering/ncASV -- In this directory, you will find the fasta files corresponding to the clustering percentage(s) you specified for the run. -4. pcASV - ${working_directory}/results/Analyze/Clustering/pcASVLooking in this directory, you probably notice some similar subdirectories. The pcASV directory also contains the Summary, Problematic, and Translation subsirectories we saw in the AminoType directory. The other important files in this directory is the nucleotide and amino acid versions of the pcASVs generated for whichever clustering percentage(s) specified.An important note for when creating pcASVs is that the subsequent analyses (phylogenies, taxonomy assignment, etc.) are run on both nucloetide and amino acid pcASV fastas. To create these files, vAMPirus translates the ASVs, checks for problematic sequences, then clusters the translated sequences by the given percentage(s). After clustering, vAMPirus will go pcASV by pcASV extracting the nucleotide sequences of the ASVs that clustered within a given pcASV. The extracted nucleotide sequences are then used to generate a consensus nucleotide sequence(s) per pcASV. +4. pcASV -- ${working_directory}/results/Analyze/Clustering/pcASV -- Looking in this directory, you probably notice some similar subdirectories. The pcASV directory also contains the Summary, Problematic, and Translation subsirectories we saw in the AminoType directory. The other important files in this directory is the nucleotide and amino acid versions of the pcASVs generated for whichever clustering percentage(s) specified.An important note for when creating pcASVs is that the subsequent analyses (phylogenies, taxonomy assignment, etc.) are run on both nucloetide and amino acid pcASV fastas. To create these files, vAMPirus translates the ASVs, checks for problematic sequences, then clusters the translated sequences by the given percentage(s). After clustering, vAMPirus will go pcASV by pcASV extracting the nucleotide sequences of the ASVs that clustered within a given pcASV. The extracted nucleotide sequences are then used to generate a consensus nucleotide sequence(s) per pcASV. -### Analyses - ${working_directory}/results/Analyze/Analyses +### Analyses - ${working_directory}/${outdir}/Analyze/Analyses For each clustering technique (i.e. ASVs, AminoTypes, ncASVs and pcASVs) performed in a given run, resulting taxonomic unit fastas will go through the following analyses (unless skip options are used): -1. Counts - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/CountsThe Counts directory is where you can find the counts tables as .csv files (and .biome as well for nucleotide counts tables). +1. Counts -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Counts -- The Counts directory is where you can find the counts tables as .csv files (and .biome as well for nucleotide counts tables). -2. Phylogeny - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/PhylogenyUnless told otherwise, vAMPirus will produce phylogenetic trees for all taxonomic unit fastas using IQ-TREE. The options for this analysis was discussed in a previous section of the docs. In the phylogeny output directory, you will find three subdirectories: (i) ./Alignment - contains trimmed MAFFT alignment used for tree, (ii) ./ModelTest - contains output files from substitution model prediction with ModelTest-NG, and (iii) ./IQ-TREE - where you can find all output files from IQ-TREE with the file of (usual) interest is the ".treefile". +2. Phylogeny -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Phylogeny -- Unless told otherwise, vAMPirus will produce phylogenetic trees for all taxonomic unit fastas using IQ-TREE. The options for this analysis was discussed in a previous section of the docs. In the phylogeny output directory, you will find three subdirectories: (i) ./Alignment - contains trimmed MAFFT alignment used for tree, (ii) ./ModelTest - contains output files from substitution model prediction with ModelTest-NG, and (iii) ./IQ-TREE - where you can find all output files from IQ-TREE with the file of (usual) interest is the ".treefile". -3. Taxonomy - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/TaxonomyvAMPirus uses Diamond blastp/x and the supplied PROTEIN database for taxonomy assignment of sequences. In the Taxonomy directory, you will find (i) a subdirectory called "DiamondOutput" which contains the original output file produced by Diamond, (ii) a fasta file that has taxonomy assignments within the sequence headers, and (iii) three different summary files (one being a phyloseq object with taxonomic information, a tab-separated summary file for review by the user and a summary table looking at abundance of specific hits). +3. Taxonomy -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Taxonomy -- vAMPirus uses DIAMOND blastp/x and the supplied PROTEIN database for taxonomy assignment of sequences. In the Taxonomy directory, you will find (i) a subdirectory called "DIAMONDOutput" which contains the original output file produced by DIAMOND, (ii) a fasta file that has taxonomy assignments within the sequence headers, and (iii) three different summary files (one being a phyloseq object with taxonomic information, a tab-separated summary file for review by the user and a summary table looking at abundance of specific hits). -4. Matrix - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/MatrixThe Matric directory is where you can find all Percent Identity matrices for produced ASV/cASV/AmintoType fastas. +4. Matrix -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/Matrix -- The Matric directory is where you can find all Percent Identity matrices for produced ASV/cASV/AmintoType fastas. -5. EMBOSS - ${working_directory}/results/Analyze/Analyses/${clustertechnique}/EMBOSSSeveral different protein physiochemical properties for all amino acid sequences are assessed using EMBOSS scripts (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/groups.html). There are four different subdirectories within EMBOSS, these include (i) ./ProteinProperties - contains files and plots regarding multiple different physiochemical properties (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/pepstats.html), (ii) ./IsoelectricPoint - contains a text file and a .svg image with plots showing the isoelectric point of protein (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/iep.html), (iii) ./HydrophobicMoment - information related to hydrophobic moments of amino acid sequences (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/hmoment.html), and (iv) ./2dStructure - information about 2D structure of proteins (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/protein_2d_structure_group.html). +5. EMBOSS -- ${working_directory}/results/Analyze/Analyses/${clustertechnique}/EMBOSS -- Several different protein physiochemical properties for all amino acid sequences are assessed using EMBOSS scripts (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/groups.html). There are four different subdirectories within EMBOSS, these include (i) ./ProteinProperties - contains files and plots regarding multiple different physiochemical properties (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/pepstats.html), (ii) ./IsoelectricPoint - contains a text file and a .svg image with plots showing the isoelectric point of protein (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/iep.html), (iii) ./HydrophobicMoment - information related to hydrophobic moments of amino acid sequences (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/hmoment.html), and (iv) ./2dStructure - information about 2D structure of proteins (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/protein_2d_structure_group.html). -### FinalReport - ${working_directory}/results/Analyze/FinalReport +### FinalReport - ${working_directory}/${outdir}/Analyze/FinalReports vAMPirus produces final reports for all taxonomic unit fastas produced in the run. These reports contain the following information: @@ -1343,7 +1490,7 @@ This is a plot that looks at number of reads per sample, similar to what is seen 4. Diversity analyses box Plots -The plots in order are (i) Shannon Diversity, (ii) Simpson Diversity, (iii) Species Richness. +The plots in order are (i) Shannon Diversity, (ii) Simpson Diversity, (iii) Richness. Stats tests included with "--stats": @@ -1358,11 +1505,11 @@ Stats tests included with "--stats": 5. Distance to centroid box plot -6. NMDS plots (2D and 3D) +6. NMDS plots (2D and 3D) -- PCoA (2D and 3D) if NMDS does not converge -7. Relative ASV/cASV abundance per sample bar chart +7. Relative sequence abundance per sample bar chart -8. Absolute ASV/cASV abundance per treatment bar chart +8. Absolute sequence abundance per treatment bar chart 9. Pairwise percent ID heatmap @@ -1370,12 +1517,14 @@ Stats tests included with "--stats": 11. Visualized phylogenetic tree +12. Post-Minimum Entropy Decomposition Analyses (combination of above) + # All of the options -Usage: +UUsage: - nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] [--asvMED] [--aminoMED] [--stats] --Help options-- @@ -1399,6 +1548,12 @@ Usage: --pcASV Set this option to have vAMPirus cluster nucleotide and translated ASVs into protein-based operational taxonomic units (pcASVs) - See options below to define a single percent similarity or a list +--Minimum Entropy Decomposition arguments-- + + --asvMED Set this option to perform Minimum Entropy Decomposition on ASV sequences, see manual for more information. You will need to set a value for --asvC to perform this analysis + + --aminoMED Set this option to perform Minimum Entropy Decomposition on AminoType sequences, see manual for more information. You will need to set a value for --aminoC to perform this analysis + --Skip options-- --skipReadProcessing Set this option to skip all read processing steps in the pipeline @@ -1407,7 +1562,9 @@ Usage: --skipAdapterRemoval Set this option to skip adapter removal in the pipeline - --skipPrimerRemoval Set this option to skup Skip primer removal process + --skipPrimerRemoval Set this option to skip primer removal process + + --skipMerging Set this option to skip read merging --skipAminoTyping Set this option to skip AminoTyping processes @@ -1415,6 +1572,10 @@ Usage: --skipPhylogeny Set this option to skip phylogeny processes + --skipEMBOSS Set this option to skip EMBOSS processes + + --skipReport Set this option to skip html report generation + **NOTE** Most opitons below can be set using the configuration file (vampirus.config) to avoid a lengthy launch command. --Project/analysis information-- @@ -1438,6 +1599,12 @@ Usage: --maxEE Use this option to set the maximum expected error rate for vsearch merging. Default is 1. + --diffs Maximum number of non-matching nucleotides allowed in overlap region. + + --maxn Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded. + + --minoverlap Minimum length of overlap for sequence merging to occur for a pair. + --Primer removal-- @@ -1449,7 +1616,7 @@ Usage: --minkmer Minimum kmer length for primer removal (default = 3) - --minilen Minimum read length after adapter and primer removal (default = 200) + --minilen Minimum non-merged read length after adapter and primer removal (default = 100) Single primer set removal- @@ -1470,18 +1637,23 @@ Usage: --alpha Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2) - --minSize Minimum size or representation for sequence to be considered in ASV generation + --minSize Minimum size or representation in the dataset for sequence to be considered in ASV generation --clusterNuclID With --ncASV set, use this option to set a single percent similarity to cluster nucleotide ASV sequences into ncASVs by [ Example: --clusterNuclID .97 ] - --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide-based clustering of ASVs with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ] + --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide clustering with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ] - --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generate pcASVs[ Example: --clusterAAID .97 ] + --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generation pcASVs [ Example: --clusterAAID .97 ] --clusterAAIDlist With --pcASV set, use this option to perform protein-based ASV clustering to generate pcASVs with a comma separated list of percent similarities [ Example: --clusterAAIDlist .95,.96,.97,.98 ] --minAA With --pcASV set, use this option to set the expected or minimum amino acid sequence length of open reading frames within your amplicon sequences +--Minimum Entropy Decomposition-- + + --asvC Number of high entropy positions to use for ASV MED analysis and generate "Groups" + + --aminoC Number of high entropy positions to use for AminoType MED analysis and generate "Groups" --Counts table generation-- @@ -1500,7 +1672,11 @@ Usage: --dbdir Path to Directory where database is being stored - --refseq Set "--refseq T" to toggle use of RefSeq header format; default is "F" to use Reverence Viral DataBase (RVDB) header + --headers Set taxonomy database header format -> headers= "NCBI" to toggle use of NCBI header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers + + --dbanno Path to directory hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using. + + --lca Set --lca T if you would like to add taxonomic classification to taxonomy results - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus" --bitscore Set minimum bitscore to allow for best hit in taxonomy assignment @@ -1539,22 +1715,20 @@ Usage: --trymax Maximum number of iterations performed by metaMDS - # Usage examples Here are some example launch commands: ## Running the --DataCheck -There is really only one launch command to run for --DataCheck: -`nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck` +`nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --DataCheck [--asvMED] [--aminoMED]` -Just submit this launch command with the correct paths and vAMPirus will run the DataCheck and produce a report for you to review. Parameters like --minAA or --maxLen apply to the --DataCheck. +Just submit this launch command with the correct paths and vAMPirus will run the DataCheck and produce a report for you to review. Para ## Running the --Analyze -### Run it all with a list of cluster IDs! +### Run it all with a list of cluster IDs `nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --minLen 400 --maxLen 420 --clusterNuclIDlist .91,.92,.93 --clusterAAIDlist .91,.93,.95,.98` diff --git a/example_data/conf/ex_reports/example_Analyze_Report.html b/example_data/conf/ex_reports/example_Analyze_Report.html deleted file mode 100644 index bc65412..0000000 --- a/example_data/conf/ex_reports/example_Analyze_Report.html +++ /dev/null @@ -1,882 +0,0 @@ - - - - - - - - - - - - - -vAMPirus Analyze Report vAMPtest_ASVs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - -

-
-


-
-
-

NOTE: Most plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report.

-
-
-

-  Pre- and Post-Adapter Removal Read Stats -

-
-
-
- -
-



-
-

Total number of reads before and after adapter removal

-

-

-
-
-

Forward (R1) and reverse (R2) read length before and after adapter removal

-

-


-
-

-  Number of Reads Per Sample -

-
-

-


-






-
-

-  Rarefaction -

-
-






-
-

-  Diversity Analyses Plots -

-
-



-
-
-

Shannon diversty

-


-

-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-##  Shapiro-Wilk normality test
-##
-## data:  resid(shannonaov)
-## W = 0.9291, p-value = 0.5732
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-##  Bartlett test of homogeneity of variances
-##
-## data:  index by treatment
-## Bartlett's K-squared = 1.0859, df = 1, p-value = 0.2974
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-##             Df Sum Sq Mean Sq F value Pr(>F)
-## treatment    1 0.0527  0.0527    0.15  0.718
-## Residuals    4 1.4061  0.3515
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-##   Tukey multiple comparisons of means
-##     95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = shannondata5_2)
-##
-## $treatment
-##                     diff       lwr     upr     p adj
-## Group2-Group1 -0.1874213 -1.531512 1.15667 0.7183604
-##
-##
-## --------------------------------------------------------------
-





-
-
-

Simpson diversty

-


-

-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-##  Shapiro-Wilk normality test
-##
-## data:  resid(simpsonaov)
-## W = 0.88738, p-value = 0.3047
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-##  Bartlett test of homogeneity of variances
-##
-## data:  index by treatment
-## Bartlett's K-squared = 0.74903, df = 1, p-value = 0.3868
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-##             Df Sum Sq Mean Sq F value Pr(>F)
-## treatment    1 0.0052 0.00519   0.059   0.82
-## Residuals    4 0.3506 0.08764
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-##   Tukey multiple comparisons of means
-##     95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = simpsondata5_2)
-##
-## $treatment
-##                      diff        lwr       upr     p adj
-## Group2-Group1 -0.05883652 -0.7299494 0.6122763 0.8196598
-##
-##
-## --------------------------------------------------------------
-





-
-
-

Species Richness

-


-

-
## [1] "Shapiro Test of normality - data is normal p-value > 0.05"
-##
-##  Shapiro-Wilk normality test
-##
-## data:  resid(richaov)
-## W = 0.93289, p-value = 0.6026
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "Bartlett Test variance homogeneity - variance is homogeneous p-value > 0.05"
-##
-##  Bartlett test of homogeneity of variances
-##
-## data:  index by treatment
-## Bartlett's K-squared = 3.2991, df = 1, p-value = 0.06932
-##
-##
-## --------------------------------------------------------------
-##
-## [1] "ANOVA Results"
-##             Df Sum Sq Mean Sq F value Pr(>F)
-## treatment    1  7.288   7.288   1.561   0.28
-## Residuals    4 18.670   4.667
-##
-## --------------------------------------------------------------
-##
-## [1] "Tukey HSD - Pairwise comparison - significant differences indicated by p-value < 0.05"
-##   Tukey multiple comparisons of means
-##     95% family-wise confidence level
-##
-## Fit: aov(formula = index ~ treatment, data = richdata5_2)
-##
-## $treatment
-##                   diff       lwr      upr     p adj
-## Group2-Group1 2.204189 -2.693449 7.101827 0.2795884
-##
-##
-## --------------------------------------------------------------
-






-
-

-  Distance To Centroid -

-
-


-

-
##
-## Call:
-## adonis(formula = bray.distance ~ data5$treatment)
-##
-## Permutation: free
-## Number of permutations: 719
-##
-## Terms added sequentially (first to last)
-##
-##                 Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
-## data5$treatment  1   1.01309 1.01309  4.6942 0.53992    0.1
-## Residuals        4   0.86326 0.21582         0.46008
-## Total            5   1.87635                 1.00000
-






-
-

-  NMDS Plots -

-
-


-
-
-

2D NMDS

-


-
## Run 0 stress 4.35166e-05
-## Run 1 stress 9.206128e-05
-## ... Procrustes: rmse 0.2192141  max resid 0.4447335
-## Run 2 stress 0
-## ... New best solution
-## ... Procrustes: rmse 0.1475603  max resid 0.2719593
-## Run 3 stress 0.1967694
-## Run 4 stress 0
-## ... Procrustes: rmse 0.1503594  max resid 0.2805032
-## Run 5 stress 0
-## ... Procrustes: rmse 0.1890022  max resid 0.2367295
-## Run 6 stress 0
-## ... Procrustes: rmse 0.1525354  max resid 0.2259206
-## Run 7 stress 0
-## ... Procrustes: rmse 0.1446005  max resid 0.2074821
-## Run 8 stress 0
-## ... Procrustes: rmse 0.132074  max resid 0.2500861
-## Run 9 stress 0
-## ... Procrustes: rmse 0.1888654  max resid 0.3103119
-## Run 10 stress 0
-## ... Procrustes: rmse 0.1321098  max resid 0.2335853
-## Run 11 stress 0
-## ... Procrustes: rmse 0.1086904  max resid 0.198957
-## Run 12 stress 0
-## ... Procrustes: rmse 0.184796  max resid 0.2895882
-## Run 13 stress 0
-## ... Procrustes: rmse 0.1985743  max resid 0.2828424
-## Run 14 stress 0
-## ... Procrustes: rmse 0.1839276  max resid 0.339448
-## Run 15 stress 0
-## ... Procrustes: rmse 0.1569956  max resid 0.2738898
-## Run 16 stress 0
-## ... Procrustes: rmse 0.08147595  max resid 0.1293814
-## Run 17 stress 0
-## ... Procrustes: rmse 0.1231843  max resid 0.1929613
-## Run 18 stress 0
-## ... Procrustes: rmse 0.2154224  max resid 0.3725087
-## Run 19 stress 0
-## ... Procrustes: rmse 0.09658586  max resid 0.169947
-## Run 20 stress 0
-## ... Procrustes: rmse 0.1123304  max resid 0.1810664
-## Run 21 stress 0
-## ... Procrustes: rmse 0.07114148  max resid 0.08562701
-## Run 22 stress 0
-## ... Procrustes: rmse 0.1207696  max resid 0.2088403
-## Run 23 stress 0
-## ... Procrustes: rmse 0.1338395  max resid 0.2011499
-## Run 24 stress 0
-## ... Procrustes: rmse 0.2011198  max resid 0.2879716
-## Run 25 stress 7.960692e-05
-## ... Procrustes: rmse 0.1835879  max resid 0.3361377
-## Run 26 stress 7.465245e-05
-## ... Procrustes: rmse 0.1516833  max resid 0.21161
-## Run 27 stress 4.134366e-05
-## ... Procrustes: rmse 0.158652  max resid 0.1955455
-## Run 28 stress 9.401493e-05
-## ... Procrustes: rmse 0.1705531  max resid 0.3146807
-## Run 29 stress 0
-## ... Procrustes: rmse 0.140281  max resid 0.1899751
-## Run 30 stress 0
-## ... Procrustes: rmse 0.1569951  max resid 0.237474
-## Run 31 stress 0
-## ... Procrustes: rmse 0.21383  max resid 0.3404873
-## Run 32 stress 0
-## ... Procrustes: rmse 0.1831665  max resid 0.2903409
-## Run 33 stress 0
-## ... Procrustes: rmse 0.1862874  max resid 0.2430944
-## Run 34 stress 0
-## ... Procrustes: rmse 0.1398732  max resid 0.194408
-## Run 35 stress 0
-## ... Procrustes: rmse 0.08385465  max resid 0.1208572
-## Run 36 stress 0
-## ... Procrustes: rmse 0.1531769  max resid 0.2654934
-## Run 37 stress 0.1420473
-## Run 38 stress 0
-## ... Procrustes: rmse 0.168821  max resid 0.2760271
-## Run 39 stress 0
-## ... Procrustes: rmse 0.2398039  max resid 0.4101748
-## Run 40 stress 4.958545e-05
-## ... Procrustes: rmse 0.1690092  max resid 0.224612
-## Run 41 stress 0
-## ... Procrustes: rmse 0.1763536  max resid 0.2979007
-## Run 42 stress 8.50985e-05
-## ... Procrustes: rmse 0.2037257  max resid 0.3652396
-## Run 43 stress 0
-## ... Procrustes: rmse 0.1373896  max resid 0.1978843
-## Run 44 stress 0
-## ... Procrustes: rmse 0.1721457  max resid 0.2384637
-## Run 45 stress 8.338233e-06
-## ... Procrustes: rmse 0.1350539  max resid 0.1826055
-## Run 46 stress 8.733644e-05
-## ... Procrustes: rmse 0.1864883  max resid 0.2843063
-## Run 47 stress 0
-## ... Procrustes: rmse 0.1287125  max resid 0.1800035
-## Run 48 stress 1.294414e-05
-## ... Procrustes: rmse 0.06488975  max resid 0.09540012
-## Run 49 stress 0
-## ... Procrustes: rmse 0.1123794  max resid 0.2193415
-## Run 50 stress 0
-## ... Procrustes: rmse 0.0873931  max resid 0.1388494
-## Run 51 stress 0
-## ... Procrustes: rmse 0.1836172  max resid 0.2892836
-## Run 52 stress 0
-## ... Procrustes: rmse 0.1349277  max resid 0.2496178
-## Run 53 stress 0
-## ... Procrustes: rmse 0.1597566  max resid 0.3010393
-## Run 54 stress 0
-## ... Procrustes: rmse 0.1578135  max resid 0.2739341
-## Run 55 stress 0.1420473
-## Run 56 stress 9.580941e-05
-## ... Procrustes: rmse 0.2385671  max resid 0.3099702
-## Run 57 stress 0
-## ... Procrustes: rmse 0.1424363  max resid 0.203378
-## Run 58 stress 0
-## ... Procrustes: rmse 0.1498625  max resid 0.2312135
-## Run 59 stress 0
-## ... Procrustes: rmse 0.08465851  max resid 0.1346232
-## Run 60 stress 0
-## ... Procrustes: rmse 0.1430299  max resid 0.2610707
-## Run 61 stress 0
-## ... Procrustes: rmse 0.1369128  max resid 0.1816959
-## Run 62 stress 0
-## ... Procrustes: rmse 0.08119163  max resid 0.1444967
-## Run 63 stress 0
-## ... Procrustes: rmse 0.1710694  max resid 0.240634
-## Run 64 stress 9.84601e-05
-## ... Procrustes: rmse 0.1878078  max resid 0.2714905
-## Run 65 stress 0
-## ... Procrustes: rmse 0.05651144  max resid 0.09672819
-## Run 66 stress 8.62831e-05
-## ... Procrustes: rmse 0.0727498  max resid 0.09722458
-## Run 67 stress 0
-## ... Procrustes: rmse 0.1958305  max resid 0.2910121
-## Run 68 stress 2.701363e-05
-## ... Procrustes: rmse 0.2225825  max resid 0.3507306
-## Run 69 stress 7.042554e-05
-## ... Procrustes: rmse 0.173566  max resid 0.2626667
-## Run 70 stress 2.745518e-05
-## ... Procrustes: rmse 0.08284383  max resid 0.124135
-## Run 71 stress 0
-## ... Procrustes: rmse 0.1840454  max resid 0.3503948
-## Run 72 stress 0
-## ... Procrustes: rmse 0.1540941  max resid 0.2302387
-## Run 73 stress 0
-## ... Procrustes: rmse 0.07616243  max resid 0.1113843
-## Run 74 stress 0
-## ... Procrustes: rmse 0.1348955  max resid 0.2398141
-## Run 75 stress 0
-## ... Procrustes: rmse 0.1593818  max resid 0.2624199
-## Run 76 stress 0
-## ... Procrustes: rmse 0.1658679  max resid 0.2368498
-## Run 77 stress 0
-## ... Procrustes: rmse 0.1083634  max resid 0.1355875
-## Run 78 stress 0
-## ... Procrustes: rmse 0.173316  max resid 0.2188972
-## Run 79 stress 2.412692e-05
-## ... Procrustes: rmse 0.1513869  max resid 0.2559945
-## Run 80 stress 0
-## ... Procrustes: rmse 0.1286293  max resid 0.1708846
-## Run 81 stress 8.249084e-05
-## ... Procrustes: rmse 0.1600543  max resid 0.2790534
-## Run 82 stress 0
-## ... Procrustes: rmse 0.06425533  max resid 0.09782641
-## Run 83 stress 0
-## ... Procrustes: rmse 0.1859754  max resid 0.3306764
-## Run 84 stress 1.218079e-05
-## ... Procrustes: rmse 0.1579732  max resid 0.2679705
-## Run 85 stress 0
-## ... Procrustes: rmse 0.1373995  max resid 0.1704452
-## Run 86 stress 0
-## ... Procrustes: rmse 0.1135352  max resid 0.1456319
-## Run 87 stress 3.345478e-06
-## ... Procrustes: rmse 0.1136627  max resid 0.1526098
-## Run 88 stress 7.30298e-05
-## ... Procrustes: rmse 0.09087387  max resid 0.1523526
-## Run 89 stress 0
-## ... Procrustes: rmse 0.1294085  max resid 0.20968
-## Run 90 stress 0
-## ... Procrustes: rmse 0.2133609  max resid 0.307295
-## Run 91 stress 0
-## ... Procrustes: rmse 0.208796  max resid 0.3331994
-## *** No convergence -- monoMDS stopping criteria:
-##     88: stress < smin
-##      1: stress ratio > sratmax
-##      2: scale factor of the gradient < sfgrmin
-
## [1] "No Convergence"
-





-
-
-

3D NMDS

-


-
## Run 0 stress 9.179259e-05
-## Run 1 stress 0
-## ... New best solution
-## ... Procrustes: rmse 0.1912644  max resid 0.3064663
-## Run 2 stress 0
-## ... Procrustes: rmse 0.1515697  max resid 0.1940969
-## Run 3 stress 8.011522e-05
-## ... Procrustes: rmse 0.2408658  max resid 0.3233412
-## Run 4 stress 0
-## ... Procrustes: rmse 0.1502567  max resid 0.2373834
-## Run 5 stress 0
-## ... Procrustes: rmse 0.1824273  max resid 0.2309358
-## Run 6 stress 0
-## ... Procrustes: rmse 0.2044446  max resid 0.3064849
-## Run 7 stress 0
-## ... Procrustes: rmse 0.1360824  max resid 0.1868316
-## Run 8 stress 0
-## ... Procrustes: rmse 0.224697  max resid 0.3889081
-## Run 9 stress 1.601269e-06
-## ... Procrustes: rmse 0.2017323  max resid 0.253622
-## Run 10 stress 0
-## ... Procrustes: rmse 0.2332277  max resid 0.2830118
-## Run 11 stress 0
-## ... Procrustes: rmse 0.2094351  max resid 0.3393129
-## Run 12 stress 0
-## ... Procrustes: rmse 0.1924744  max resid 0.2476726
-## Run 13 stress 0
-## ... Procrustes: rmse 0.1723438  max resid 0.2779048
-## Run 14 stress 0
-## ... Procrustes: rmse 0.231938  max resid 0.3082758
-## Run 15 stress 0
-## ... Procrustes: rmse 0.1591339  max resid 0.2941593
-## Run 16 stress 1.844663e-05
-## ... Procrustes: rmse 0.1367305  max resid 0.1893536
-## Run 17 stress 0
-## ... Procrustes: rmse 0.2432104  max resid 0.3742779
-## Run 18 stress 0
-## ... Procrustes: rmse 0.2021536  max resid 0.3190184
-## Run 19 stress 8.261138e-05
-## ... Procrustes: rmse 0.2661264  max resid 0.3395278
-## Run 20 stress 0
-## ... Procrustes: rmse 0.2050021  max resid 0.3470228
-## Run 21 stress 0
-## ... Procrustes: rmse 0.2349675  max resid 0.3899205
-## Run 22 stress 0
-## ... Procrustes: rmse 0.1743172  max resid 0.3327414
-## Run 23 stress 0
-## ... Procrustes: rmse 0.1805854  max resid 0.2484891
-## Run 24 stress 0
-## ... Procrustes: rmse 0.2214565  max resid 0.3735158
-## Run 25 stress 6.602248e-05
-## ... Procrustes: rmse 0.2633094  max resid 0.3328158
-## Run 26 stress 0
-## ... Procrustes: rmse 0.1989565  max resid 0.3402747
-## Run 27 stress 0
-## ... Procrustes: rmse 0.2434265  max resid 0.2853429
-## Run 28 stress 0
-## ... Procrustes: rmse 0.1437204  max resid 0.2008141
-## Run 29 stress 0
-## ... Procrustes: rmse 0.2207409  max resid 0.2922027
-## Run 30 stress 0
-## ... Procrustes: rmse 0.140996  max resid 0.2320052
-## Run 31 stress 0
-## ... Procrustes: rmse 0.1889331  max resid 0.2900138
-## Run 32 stress 0
-## ... Procrustes: rmse 0.1834047  max resid 0.2422162
-## Run 33 stress 0
-## ... Procrustes: rmse 0.2447367  max resid 0.3580808
-## Run 34 stress 0
-## ... Procrustes: rmse 0.2359379  max resid 0.3172238
-## Run 35 stress 0
-## ... Procrustes: rmse 0.2428633  max resid 0.3027342
-## Run 36 stress 0
-## ... Procrustes: rmse 0.2011346  max resid 0.344021
-## Run 37 stress 0
-## ... Procrustes: rmse 0.2398276  max resid 0.4186627
-## Run 38 stress 6.012963e-05
-## ... Procrustes: rmse 0.2290966  max resid 0.3002383
-## Run 39 stress 0
-## ... Procrustes: rmse 0.2681661  max resid 0.4893852
-## Run 40 stress 2.102041e-05
-## ... Procrustes: rmse 0.1696101  max resid 0.2438495
-## Run 41 stress 0
-## ... Procrustes: rmse 0.2461656  max resid 0.3874782
-## Run 42 stress 8.813298e-05
-## ... Procrustes: rmse 0.2458929  max resid 0.3572947
-## Run 43 stress 1.301719e-06
-## ... Procrustes: rmse 0.196147  max resid 0.322213
-## Run 44 stress 0
-## ... Procrustes: rmse 0.1521486  max resid 0.2327491
-## Run 45 stress 7.430631e-05
-## ... Procrustes: rmse 0.2806763  max resid 0.3795962
-## Run 46 stress 0
-## ... Procrustes: rmse 0.1268192  max resid 0.1828082
-## Run 47 stress 0
-## ... Procrustes: rmse 0.1656652  max resid 0.2345423
-## Run 48 stress 0
-## ... Procrustes: rmse 0.2071554  max resid 0.3748716
-## Run 49 stress 8.899945e-05
-## ... Procrustes: rmse 0.2411995  max resid 0.3239697
-## Run 50 stress 8.49387e-05
-## ... Procrustes: rmse 0.2235364  max resid 0.2957555
-## Run 51 stress 8.946254e-05
-## ... Procrustes: rmse 0.2411916  max resid 0.3239556
-## Run 52 stress 0
-## ... Procrustes: rmse 0.1024931  max resid 0.1772733
-## Run 53 stress 0
-## ... Procrustes: rmse 0.1805318  max resid 0.2266111
-## Run 54 stress 0
-## ... Procrustes: rmse 0.222073  max resid 0.3309312
-## Run 55 stress 0
-## ... Procrustes: rmse 0.1984136  max resid 0.2676206
-## Run 56 stress 4.061045e-05
-## ... Procrustes: rmse 0.1650274  max resid 0.2850801
-## Run 57 stress 0
-## ... Procrustes: rmse 0.2278851  max resid 0.350922
-## Run 58 stress 0
-## ... Procrustes: rmse 0.1401086  max resid 0.185101
-## Run 59 stress 8.737338e-05
-## ... Procrustes: rmse 0.2411935  max resid 0.3239583
-## Run 60 stress 0
-## ... Procrustes: rmse 0.2191345  max resid 0.3940798
-## Run 61 stress 2.198444e-05
-## ... Procrustes: rmse 0.2350082  max resid 0.3891647
-## Run 62 stress 0
-## ... Procrustes: rmse 0.2048775  max resid 0.265205
-## Run 63 stress 0
-## ... Procrustes: rmse 0.2213181  max resid 0.3503163
-## Run 64 stress 0
-## ... Procrustes: rmse 0.1501829  max resid 0.2294057
-## Run 65 stress 0
-## ... Procrustes: rmse 0.1411666  max resid 0.2546569
-## Run 66 stress 0
-## ... Procrustes: rmse 0.1464234  max resid 0.2116882
-## Run 67 stress 0
-## ... Procrustes: rmse 0.1652729  max resid 0.2477694
-## Run 68 stress 9.870175e-05
-## ... Procrustes: rmse 0.2505968  max resid 0.3328449
-## Run 69 stress 0.07976986
-## Run 70 stress 0
-## ... Procrustes: rmse 0.194331  max resid 0.2432624
-## Run 71 stress 6.941015e-05
-## ... Procrustes: rmse 0.2130009  max resid 0.3004891
-## Run 72 stress 7.615578e-05
-## ... Procrustes: rmse 0.259831  max resid 0.34112
-## Run 73 stress 0
-## ... Procrustes: rmse 0.1647591  max resid 0.2388743
-## Run 74 stress 0
-## ... Procrustes: rmse 0.1644595  max resid 0.2847747
-## Run 75 stress 0
-## ... Procrustes: rmse 0.1906808  max resid 0.3100185
-## Run 76 stress 0
-## ... Procrustes: rmse 0.146477  max resid 0.2315368
-## Run 77 stress 7.361458e-05
-## ... Procrustes: rmse 0.2474006  max resid 0.3599023
-## Run 78 stress 0
-## ... Procrustes: rmse 0.1476474  max resid 0.2302322
-## Run 79 stress 0
-## ... Procrustes: rmse 0.1500008  max resid 0.1972427
-## Run 80 stress 0
-## ... Procrustes: rmse 0.2317812  max resid 0.3787597
-## Run 81 stress 0
-## ... Procrustes: rmse 0.2462261  max resid 0.3285394
-## Run 82 stress 0
-## ... Procrustes: rmse 0.1343457  max resid 0.1992015
-## Run 83 stress 0
-## ... Procrustes: rmse 0.1171339  max resid 0.2019657
-## Run 84 stress 0
-## ... Procrustes: rmse 0.1754114  max resid 0.255796
-## Run 85 stress 0
-## ... Procrustes: rmse 0.17915  max resid 0.2562237
-## Run 86 stress 0
-## ... Procrustes: rmse 0.2158695  max resid 0.2893261
-## Run 87 stress 0
-## ... Procrustes: rmse 0.1897326  max resid 0.3018936
-## Run 88 stress 8.264709e-05
-## ... Procrustes: rmse 0.2704232  max resid 0.3402092
-## Run 89 stress 0
-## ... Procrustes: rmse 0.1770887  max resid 0.2768628
-## Run 90 stress 7.124833e-05
-## ... Procrustes: rmse 0.2274275  max resid 0.3037865
-## Run 91 stress 0
-## ... Procrustes: rmse 0.2291196  max resid 0.3758455
-## *** No convergence -- monoMDS stopping criteria:
-##     90: stress < smin
-##      1: stress ratio > sratmax
-
## [1] "No Convergence"
-






-
-

-  OTU Abundance Per Sample -

-
-

-




-
-

-  OTU Abundance Per Treatment -

-
-

-




-
-

-  Pairwise Percent-ID Heatmap -

-
-



-




-
-

-  Taxonomy Result Visualization -

-
-



-
- -






-
-

-  Phylogenetic Tree -

-
-



-

- This tree is a maximum likelihood tree made with IQTREE2 and the parameters you specified in the vampirus.config file. Also, this is an interactive tree, you can zoom in and hover on nodes to know the sequence ID. For a better visualization of this tree, you can find the *.treefile with bootstrap support values within the results directory and visualize using programs like FigTree or ITOL.








-
-
- - - - -
- - - - - - - - - - - - - - - diff --git a/example_data/conf/ex_reports/example_DataCheck_Report.html b/example_data/conf/ex_reports/example_DataCheck_Report.html deleted file mode 100644 index c903911..0000000 --- a/example_data/conf/ex_reports/example_DataCheck_Report.html +++ /dev/null @@ -1,391 +0,0 @@ - - - - - - - - - - - - - -vAMPirus DataCheck Report: vAMPtest - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - -

-

—————————————————————————————————————————————————————————————————————— NOTE: Plots are interactive and you can use the legend to specify samples/treatment of interest. You can also download an .svg version of each figure within this report. ——————————————————————————————————————————————————————————————————————
-
-

-  Pre- and Post-Adapter Removal Read Stats -

-
-
-
- -
-



-
-

Total number of reads before and after adapter removal

-

-

-
-
-

Forward (R1) and reverse (R2) read length before and after adapter removal

-
-


-
-

-  Post-Merging Read Stats -

-
-



-
-
-

Pre-filtering reads per sample

-


-
- -


-
-
-

Post-filtering reads per sample

-


-
- -


-
-
-

Pre-filtering base frequency per position on reads

-


-
- -


-
-
-

Post-filtering base frequency per position on reads

-


-
- -


-
-
-

Pre-filtering mean quality score per position on reads

-


-
- -


-
-
-

Post-filtering mean quality score per position on reads

-


-
- -


-
-
-

Pre-filtering read GC-content

-


-
## Mean,45.279
-## Median,45.309
-## Mode,45.110
-## STDev,1.617
-## GC,Count
-
- -


-
-
-

Post-filtering read GC-content

-


-
## Mean,45.656
-## Median,45.476
-## Mode,44.762
-## STDev,1.219
-## GC,Count
-
- -


-
-
-

Number of reads per quality score (pre-filtering)

-


-
- -


-
-
-

Number of reads per quality score (post-filtering)

-


-
- -


-
-
-

Number of reads per length (pre-filtering)

-


-
- -


-
-
-

Number of reads per length (post-filtering)

-


-
- -
-
-

-  Clustering Statistics -

-
-


-
-
-

Number of ncASVs per clustering percentage – “1” represents number of ASVs

-


-

- NOTE: The “1” on the x-axis represents number of ASVs identified by vsearch

-
-
-

Number of pcASVs per clustering percentage

-


-

- NOTE: The “1” represents the number of AminoTypes which are unique amino acid sequences in your dataset








-
- - - - -
- - - - - - - - - - - - - - - diff --git a/example_data/conf/test.config b/example_data/conf/test.config index 774f905..cbc63e1 100644 --- a/example_data/conf/test.config +++ b/example_data/conf/test.config @@ -3,8 +3,7 @@ Test Config File vAMPirus ======================================================================================== Virus Amplicon Analysis Pipeline - Author: Alex J.Veglia - Version: 1.0 (dev) + Author: Alex J.Veglia, Ramón E. Rivera Vicéns ---------------------------------------------------------------------------------------- */ @@ -15,9 +14,14 @@ params { outdir="vAMPtestresults" workingdir="${projectDir}/example_data/" projtag="vAMPtest" - clusterNuclID=".85" - clusterAAID=".85" + clusterNuclID="85" + clusterAAID="85" dbname="test_db.fasta" dbdir="${projectDir}/example_data/conf/" - stats="run" + dbtype="RVDB" + aminoC=5 + asvC=5 + stats = true + aminoMED = true + asvMED = true } diff --git a/vAMPirus.nf b/vAMPirus.nf index abef3e5..78d02bb 100644 --- a/vAMPirus.nf +++ b/vAMPirus.nf @@ -33,7 +33,7 @@ def helpMessage() { Usage: - nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] [--asvMED] [--aminoMED] [--stats] --Help options-- @@ -57,6 +57,12 @@ def helpMessage() { --pcASV Set this option to have vAMPirus cluster nucleotide and translated ASVs into protein-based operational taxonomic units (pcASVs) - See options below to define a single percent similarity or a list + --Minimum Entropy Decomposition arguments-- + + --asvMED Set this option to perform Minimum Entropy Decomposition on ASV sequences, see manual for more information. You will need to set a value for --asvC to perform this analysis + + --aminoMED Set this option to perform Minimum Entropy Decomposition on AminoType sequences, see manual for more information. You will need to set a value for --aminoC to perform this analysis + --Skip options-- --skipReadProcessing Set this option to skip all read processing steps in the pipeline @@ -65,7 +71,9 @@ def helpMessage() { --skipAdapterRemoval Set this option to skip adapter removal in the pipeline - --skipPrimerRemoval Set this option to skup Skip primer removal process + --skipPrimerRemoval Set this option to skip primer removal process + + --skipMerging Set this option to skip read merging --skipAminoTyping Set this option to skip AminoTyping processes @@ -73,6 +81,10 @@ def helpMessage() { --skipPhylogeny Set this option to skip phylogeny processes + --skipEMBOSS Set this option to skip EMBOSS processes + + --skipReport Set this option to skip html report generation + **NOTE** Most opitons below can be set using the configuration file (vampirus.config) to avoid a lengthy launch command. --Project/analysis information-- @@ -96,6 +108,12 @@ def helpMessage() { --maxEE Use this option to set the maximum expected error rate for vsearch merging. Default is 1. + --diffs Maximum number of non-matching nucleotides allowed in overlap region. + + --maxn Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded. + + --minoverlap Minimum length of overlap for sequence merging to occur for a pair. + --Primer removal-- @@ -107,7 +125,7 @@ def helpMessage() { --minkmer Minimum kmer length for primer removal (default = 3) - --minilen Minimum read length after adapter and primer removal (default = 200) + --minilen Minimum non-merged read length after adapter and primer removal (default = 100) Single primer set removal- @@ -128,7 +146,7 @@ def helpMessage() { --alpha Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2) - --minSize Minimum size or representation for sequence to be considered in ASV generation + --minSize Minimum size or representation in the dataset for sequence to be considered in ASV generation --clusterNuclID With --ncASV set, use this option to set a single percent similarity to cluster nucleotide ASV sequences into ncASVs by [ Example: --clusterNuclID .97 ] @@ -140,6 +158,11 @@ def helpMessage() { --minAA With --pcASV set, use this option to set the expected or minimum amino acid sequence length of open reading frames within your amplicon sequences + --Minimum Entropy Decomposition-- + + --asvC Number of high entropy positions to use for ASV MED analysis and generate "Groups" + + --aminoC Number of high entropy positions to use for AminoType MED analysis and generate "Groups" --Counts table generation-- @@ -158,7 +181,11 @@ def helpMessage() { --dbdir Path to Directory where database is being stored - --refseq Set "--refseq T" to toggle use of RefSeq header format; default is "F" to use Reverence Viral DataBase (RVDB) header + --headers Set taxonomy database header format -> headers= "NCBI" to toggle use of NCBI header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers + + --dbanno Path to directory hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using. + + --lca Set --lca T if you would like to add taxonomic classification to taxonomy results - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus" --bitscore Set minimum bitscore to allow for best hit in taxonomy assignment @@ -205,186 +232,213 @@ def fullHelpMessage() { THIS IS A LONGER HELP WITH USAGE EXAMPLES vAMPirus v${workflow.manifest.version} ============================================================================================================================================================================================== - Steps: - 1- Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases + Steps: + 1- Before launching the vAMPirus.nf, be sure to run the vampirus_startup.sh script to install dependencies and/or databases + + 2- Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see STARTUP_HELP.txt for test commands and other examples) + + 3. Edit parameters in vampirus.config file + + 4. Launch the DataCheck pipeline to get summary information about your dataset - 2- Test the vAMPirus installation with the provided test dataset (if you have ran the start up script, you can see STARTUP_HELP.txt for test commands and other examples) + 5. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length) - 3. Edit parameters in vampirus.config file + 6. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset - 4. Launch the DataCheck pipeline to get summary information about your dataset + 7. Explore results directories and produced final reports - 5. Change any parameters in vampirus.config file that might aid your analysis (e.g. clustering ID, maximum merged read length) - 6. Launch the Analyze pipeline to perform a comprehensive analysis with your dataset + Usage: - 7. Explore results directories and produced final reports + nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] - Usage: + --Help options-- - nextflow run vAMPirus.nf -c vampirus.config -profile [conda|singularity] --[Analyze|DataCheck] [--ncASV] [--pcASV] + --help Print help information + --fullHelp Print even more help information - --Help options-- - --help Print help information + --Mandatory arguments (choose one)-- - --fullHelp Print even more help information + --Analyze Run absolutely everything + --DataCheck Assess how data performs with during processing and clustering - --Mandatory arguments (choose one)-- - --Analyze Run absolutely everything + --ASV clustering arguments-- - --DataCheck Assess how data performs with during processing and clustering + --ncASV Set this option to have vAMPirus cluster nucleotide amplicon sequence variants (ASVs) into nucleotide-based operational taxonomic units (ncASVs) - See options below to define a single percent similarity or a list + --pcASV Set this option to have vAMPirus cluster nucleotide and translated ASVs into protein-based operational taxonomic units (pcASVs) - See options below to define a single percent similarity or a list - --ASV clustering arguments-- - --ncASV Set this option to have vAMPirus cluster nucleotide amplicon sequence variants (ASVs) into nucleotide-based operational taxonomic units (ncASVs) - See options below to define a single percent similarity or a list + --Minimum Entropy Decomposition arguments-- - --pcASV Set this option to have vAMPirus cluster nucleotide and translated ASVs into protein-based operational taxonomic units (pcASVs) - See options below to define a single percent similarity or a list + --asvMED Set this option to perform Minimum Entropy Decomposition on ASV sequences, see manual for more information. You will need to set a value for --asvC to perform this analysis + --aminoMED Set this option to perform Minimum Entropy Decomposition on AminoType sequences, see manual for more information. You will need to set a value for --aminoC to perform this analysis - --Skip options-- + --Skip options-- - --skipReadProcessing Set this option to skip all read processing steps in the pipeline + --skipReadProcessing Set this option to skip all read processing steps in the pipeline - --skipFastQC Set this option to skiip FastQC steps in the pipeline + --skipFastQC Set this option to skiip FastQC steps in the pipeline - --skipAdapterRemoval Set this option to skip adapter removal in the pipeline + --skipAdapterRemoval Set this option to skip adapter removal in the pipeline - --skipPrimerRemoval Set this option to skup Skip primer removal process + --skipPrimerRemoval Set this option to skip primer removal process - --skipAminoTyping Set this option to skip AminoTyping processes + --skipMerging Set this option to skip read merging - --skipTaxonomy Set this option to skip taxonomy assignment processes + --skipAminoTyping Set this option to skip AminoTyping processes - --skipPhylogeny Set this option to skip phylogeny processes + --skipTaxonomy Set this option to skip taxonomy assignment processes - **NOTE** Most opitons below can be set using the configuration file (vampirus.config) to avoid a lengthy launch command. + --skipPhylogeny Set this option to skip phylogeny processes - --Project/analysis information-- + --skipEMBOSS Set this option to skip EMBOSS processes - --projtag Set project name to be used as a prefix for output files + --skipReport Set this option to skip html report generation - --metadata Set path to metadata spreadsheet file to be used for report generation (must be defined if generating report) + **NOTE** Most opitons below can be set using the configuration file (vampirus.config) to avoid a lengthy launch command. - --reads Path to directory containing read libraries, must have *R{1,2}* in the library names + --Project/analysis information-- - --workingdir Path to working directory where Nextflow will put all Nextflow and vAMPirus generated output files + --projtag Set project name to be used as a prefix for output files - --outdir Name of results directory containing all output from the chosen pipeline (will be made within the working directory) + --metadata Set path to metadata spreadsheet file to be used for report generation (must be defined if generating report) + --reads Path to directory containing read libraries, must have *R{1,2}* in the library names - --Merged read length filtering-- + --workingdir Path to working directory where Nextflow will put all Nextflow and vAMPirus generated output files - --minLen Minimum merged read length - reads below the specified maximum read length will be used for counts only + --outdir Name of results directory containing all output from the chosen pipeline (will be made within the working directory) - --maxLen Maximum merged read length - reads with length equal to the specified max read length will be used to identifying unique sequences and subsequent Amplicon Sequence Variant (ASV) analysis - --maxEE Use this option to set the maximum expected error rate for vsearch merging. Default is 1. + --Merged read length filtering-- + --minLen Minimum merged read length - reads below the specified maximum read length will be used for counts only - --Primer removal-- + --maxLen Maximum merged read length - reads with length equal to the specified max read length will be used to identifying unique sequences and subsequent Amplicon Sequence Variant (ASV) analysis - General primer removal parameters + --maxEE Use this option to set the maximum expected error rate for vsearch merging. Default is 1. - --primerLength Use this option to set the max primer length to restrict bbduk.sh primer trimming to the first x number of bases + --diffs Maximum number of non-matching nucleotides allowed in overlap region. - --maxkmer Maximum kmer length for bbduk.sh to use for primer detection and removal (must be shorter than your primer length; default = 13) + --maxn Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded. - --minkmer Minimum kmer length for primer removal (default = 3) + --minoverlap Minimum length of overlap for sequence merging to occur for a pair. - --minilen Minimum read length after adapter and primer removal (default = 200) - Single primer set removal- + --Primer removal-- - --GlobTrim Set this option to perform global trimming to reads to remove primer sequences. Example usage "--GlobTrim #basesfromforward,#basesfromreverse" + General primer removal parameters - --fwd Forward primer sequence for reads to be detected and removed from reads (must specify reverse sequence if providing forward) + --primerLength Use this option to set the max primer length to restrict bbduk.sh primer trimming to the first x number of bases - --rev Reverse primer sequence for reads to be detected and removed from reads (must specify forward sequence if providing reverse) + --maxkmer Maximum kmer length for bbduk.sh to use for primer detection and removal (must be shorter than your primer length; default = 13) - Multiple primer set removal- + --minkmer Minimum kmer length for primer removal (default = 3) - --multi Use this option to signal multiple primer sequence removal within the specified pipeline + --minilen Minimum non-merged read length after adapter and primer removal (default = 100) - --primers Use this option to set the path to a fasta file with all of the primer sequences to be detected and removed from reads + Single primer set removal- + --GlobTrim Set this option to perform global trimming to reads to remove primer sequences. Example usage "--GlobTrim #basesfromforward,#basesfromreverse" - --Amplicon Sequence Variant (ASV) genration and clustering-- + --fwd Forward primer sequence for reads to be detected and removed from reads (must specify reverse sequence if providing forward) - --alpha Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2) + --rev Reverse primer sequence for reads to be detected and removed from reads (must specify forward sequence if providing reverse) - --minSize Minimum size or representation for sequence to be considered in ASV generation + Multiple primer set removal- - --clusterNuclID With --ncASV set, use this option to set a single percent similarity to cluster nucleotide ASV sequences into ncASVs by [ Example: --clusterNuclID .97 ] + --multi Use this option to signal multiple primer sequence removal within the specified pipeline - --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide-based clustering of ASVs with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ] + --primers Use this option to set the path to a fasta file with all of the primer sequences to be detected and removed from reads - --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generate pcASVs[ Example: --clusterAAID .97 ] - --clusterAAIDlist With --pcASV set, use this option to perform protein-based ASV clustering to generate pcASVs with a comma separated list of percent similarities [ Example: --clusterAAIDlist .95,.96,.97,.98 ] + --Amplicon Sequence Variant (ASV) genration and clustering-- - --minAA With --pcASV set, use this option to set the expected or minimum amino acid sequence length of open reading frames within your amplicon sequences + --alpha Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2) + --minSize Minimum size or representation in the dataset for sequence to be considered in ASV generation - --Counts table generation-- + --clusterNuclID With --ncASV set, use this option to set a single percent similarity to cluster nucleotide ASV sequences into ncASVs by [ Example: --clusterNuclID .97 ] - --asvcountID Similarity ID to use for ASV counts + --clusterNuclIDlist With --ncASV set, use this option to perform nucleotide clustering with a comma separated list of percent similarities [ Example: --clusterNuclIDlist .95,.96,.97,.98 ] - --ProtCountID Minimum amino acid sequence similarity for hit to count + --clusterAAID With --pcASV set, use this option to set a single percent similarity for protein-based ASV clustering to generation pcASVs [ Example: --clusterAAID .97 ] - --ProtCountsLength Minimum alignment length for hit to count + --clusterAAIDlist With --pcASV set, use this option to perform protein-based ASV clustering to generate pcASVs with a comma separated list of percent similarities [ Example: --clusterAAIDlist .95,.96,.97,.98 ] - --ProtCountsBit Minimum bitscore for hit to be counted + --minAA With --pcASV set, use this option to set the expected or minimum amino acid sequence length of open reading frames within your amplicon sequences + --Minimum Entropy Decomposition-- - --Taxonomy inference parameters-- + --asvC Number of high entropy positions to use for ASV MED analysis and generate "Groups" - --dbname Specify name of database to use for analysis + --aminoC Number of high entropy positions to use for AminoType MED analysis and generate "Groups" - --dbdir Path to Directory where database is being stored + --Counts table generation-- - --refseq Set "--refseq T" to toggle use of RefSeq header format; default is "F" to use Reverence Viral DataBase (RVDB) header + --asvcountID Similarity ID to use for ASV counts - --bitscore Set minimum bitscore to allow for best hit in taxonomy assignment + --ProtCountID Minimum amino acid sequence similarity for hit to count - --minID Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment + --ProtCountsLength Minimum alignment length for hit to count - --minaln Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment + --ProtCountsBit Minimum bitscore for hit to be counted - --Phylogeny analysis parameters-- + --Taxonomy inference parameters-- - Setting customs options for IQ-TREE (Example: "-option1 A -option2 B -option3 C -option4 D") - might be easier to set in the vampirus.config file at lines 108/109 + --dbname Specify name of database to use for analysis - --iqCustomnt Use option to set custom options to use in all IQTREE analyses with nuceoltide sequences + --dbdir Path to Directory where database is being stored - --iqCustomaa Use option to set custom options to use in all IQTREE analyses with amino acid sequences + --headers Set taxonomy database header format -> headers= "NCBI" to toggle use of NCBI header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers - These options below you can set at the command, for example, to set to use model from ModelTest-NG with parametric bootstrapping --ModelTnt --ModelTaa --parametric + --dbanno Path to directory hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using. - --ModelTnt=false Signal for IQ-TREE to use model determined by ModelTest-NG for all IQTREE analyses with nuceoltide sequences (Default is IQ-TREE will do automatic model testing with ModelFinder Plus) + --lca Set --lca T if you would like to add taxonomic classification to taxonomy results - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus" - --ModelTaa=false Signal for IQ-TREE to use model determined by ModelTest-NG for all IQTREE analyses with amino acid sequences + --bitscore Set minimum bitscore to allow for best hit in taxonomy assignment - --parametric Set to use parametric bootstrapping in IQTREE analyses + --minID Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment - --nonparametric Set to use parametric bootstrapping in IQTREE analyses + --minaln Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment - --boots Number of bootstraps (recommended 1000 for parametric and 100 for non-parametric) + --Phylogeny analysis parameters-- - --Statistics options-- + Setting customs options for IQ-TREE (Example: "-option1 A -option2 B -option3 C -option4 D") - might be easier to set in the vampirus.config file at lines 108/109 - --stats Set "--stats run" to signal statstical tests to be performed and included in the final report + --iqCustomnt Use option to set custom options to use in all IQTREE analyses with nuceoltide sequences - --minimumCounts Minimum number of hit counts for a sample to have to be included in the downstream statistical analyses and report generation + --iqCustomaa Use option to set custom options to use in all IQTREE analyses with amino acid sequences - --trymax Maximum number of iterations performed by metaMDS + These options below you can set at the command, for example, to set to use model from ModelTest-NG with parametric bootstrapping --ModelTnt --ModelTaa --parametric + + --ModelTnt=false Signal for IQ-TREE to use model determined by ModelTest-NG for all IQTREE analyses with nuceoltide sequences (Default is IQ-TREE will do automatic model testing with ModelFinder Plus) + + --ModelTaa=false Signal for IQ-TREE to use model determined by ModelTest-NG for all IQTREE analyses with amino acid sequences + + --parametric Set to use parametric bootstrapping in IQTREE analyses + + --nonparametric Set to use parametric bootstrapping in IQTREE analyses + + --boots Number of bootstraps (recommended 1000 for parametric and 100 for non-parametric) + + + --Statistics options-- + + --stats Set "--stats run" to signal statstical tests to be performed and included in the final report + + --minimumCounts Minimum number of hit counts for a sample to have to be included in the downstream statistical analyses and report generation + + --trymax Maximum number of iterations performed by metaMDS ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| @@ -397,9 +451,9 @@ def fullHelpMessage() { DataCheck pipeline => - Example 1. Launching the vAMPirus DataCheck pipeline using conda + Example 1. Launching the vAMPirus DataCheck pipeline with MED analyses using conda - nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck + nextflow run vAMPirus.nf -c vampirus.config -profile conda --DataCheck --asvMED --aminoMED Example 2. Launching the vAMPirus DataCheck pipeline using Singularity and multiple primer removal with the path to the fasta file with the primer sequences set in the launch command @@ -414,7 +468,7 @@ def fullHelpMessage() { Example 4. Launching the vAMPirus Analyze pipeline with singularity with ASV and AminoType generation with all accesory analyses (taxonomy assignment, EMBOSS, IQTREE, statistics) - nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats run + nextflow run vAMPirus.nf -c vampirus.config -profile singularity --Analyze --stats Example 5. Launching the vAMPirus Analyze pipeline with conda to perform multiple primer removal and protein-based clustering of ASVs, but skip most of the extra analyses @@ -422,7 +476,11 @@ def fullHelpMessage() { Example 6. Launching vAMPirus Analyze pipeline with conda to produce only ASV-related results - nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats run + nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats + + Example 7. Launching vAMPirus Analyze pipeline with conda to perform ASV analyses with Minimum Entropy Decomposition to form "Groups" + + nextflow run vAMPirus.nf -c vampirus.config -profile conda --Analyze --skipAminoTyping --stats --asvMED --asvC 24 Resuming analyses => @@ -445,17 +503,37 @@ if (params.help) { exit 0 } -// This will be printed to the user in each run. here thy can check if the values the selected are fine log.info """\ ================================================================================================================================================ vAMPirus v${workflow.manifest.version} - Virus Amplicon Sequencing Analysis Pipeline ================================================================================================================================================ - Project name: ${params.projtag} - Working directory: ${params.workingdir} - Results directory: ${params.outdir} - Database directory: ${params.dbdir} - Database name: ${params.dbname} - Metadata file: ${params.metadata} + + -------------------------------------------------Project details--------------------------------------------- + Project name: ${params.projtag} + Working directory: ${params.workingdir} + Results directory: ${params.outdir} + Metadata file: ${params.metadata} + + ---------------------------------------------------Run details------------------------------------------------ + Minimum merged read length: ${params.maxLen} + ASV filtering: ${params.filter} + Database directory: ${params.dbdir} + Database name: ${params.dbname} + Database type: ${params.dbtype} + ncASV: ${params.ncASV} + pcASV: ${params.pcASV} + ASV MED: ${params.asvMED} + AminoType MED: ${params.aminoMED} + Skip FastQC: ${params.skipFastQC} + Skip read processing: ${params.skipReadProcessing} + Skip adapter removal: ${params.skipAdapterRemoval} + Skip primer removal: ${params.skipPrimerRemoval} + Skip read merging: ${params.skipMerging} + Skip AminoTyping: ${params.skipAminoTyping} + Skip Taxonomy: ${params.skipTaxonomy} + Skip phylogeny: ${params.skipPhylogeny} + Skip EMBOSS: ${params.skipEMBOSS} + Skip Reports: ${params.skipReport} """.stripIndent() if (params.readsTest) { @@ -463,19 +541,37 @@ if (params.readsTest) { Channel .fromFilePairs(params.readsTest) .ifEmpty{ exit 1, "params.readTest was empty - no input files supplied" } - .into{ reads_ch; reads_qc_ch } + .into{ reads_ch; reads_qc_ch; reads_processing } } else { println("\n\tEverything ready for launch.\n") Channel .fromFilePairs("${params.reads}", checkIfExists: true) - .into{ reads_ch; reads_qc_ch } + .into{ reads_ch; reads_qc_ch; reads_processing } +} +if (params.clusterNuclIDlist == "") { + a=params.clusterNuclID + slist=[a] + nnuc=slist.size() +} else { + msize=params.clusterNuclIDlist + slist=msize.split(',').collect{it as int} + nnuc=slist.size() +} +if (params.clusterAAIDlist == "") { + b=params.clusterAAID + slist2=[b] + naa=slist2.size() +} else { + msize2=params.clusterAAIDlist + slist2=msize2.split(',').collect{it as int} + naa=slist2.size() } -if (params.Analyze) { +if (params.DataCheck || params.Analyze) { - println("\n\tRunning vAMPirus Analyze pipeline - This might take a while, check out Nextflow tower (tower.nf) to remotely monitor the run.\n") + println("\n\tRunning vAMPirus - This might take a while depending on the mode and dataset size, check out Nextflow tower (tower.nf) to remotely monitor the run.\n") - if (!params.skipTaxonomy) { + if (!params.skipTaxonomy && params.Analyze) { process Database_Check { script: @@ -500,7 +596,10 @@ if (params.Analyze) { cp ${params.vampdir}/Databases/${params.dbname}* ${params.dbdir}/ if [ ! -e ${params.dbdir}/${params.dbname}.dmnd ];then echo "It needs to be built upp, doing it now" - diamond makedb --in ${params.dbdir}/${params.dbname} -d ${params.dbdir}/${params.dbname} + if [[ ${params.ncbitax} == "true" && ${params.dbtype} == "NCBI" ]] + then diamond makedb --in ${params.dbdir}/${params.dbname} -d ${params.dbdir}/${params.dbname} --taxonmap ${params.dbdir}/NCBItaxonomy/prot.accession2taxid.FULL --taxonnodes ${params.dbdir}/NCBItaxonomy/nodes.dmp --taxonnames ${params.dbdir}/NCBItaxonomy/names.dmp + else diamond makedb --in ${params.dbdir}/${params.dbname} -d ${params.dbdir}/${params.dbname} + fi export virdb=${params.dbdir}/${params.dbname} else echo "Database looks to be present and built." @@ -530,7 +629,7 @@ if (params.Analyze) { } } - if (!params.skipReadProcessing) { + if (!params.skipReadProcessing || !params.skipMerging ) { if (!params.skipFastQC) { @@ -540,13 +639,13 @@ if (params.Analyze) { tag "${sample_id}" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/FastQC/PreClean", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/FastQC/PreClean", mode: "copy", overwrite: true input: tuple sample_id, file(reads) from reads_qc_ch output: - tuple sample_id, file("*_fastqc.{zip,html}") into fastqc_results_OAS + tuple sample_id, file("*_fastqc.{zip,html}") into fastqc_results script: """ @@ -563,8 +662,8 @@ if (params.Analyze) { tag "${sample_id}" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/AdapterRemoval", mode: "copy", overwrite: true, pattern: "*.filter.fq" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/AdapterRemoval/fastpOut", mode: "copy", overwrite: true, pattern: "*.fastp.{json,html}" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/AdapterRemoval", mode: "copy", overwrite: true, pattern: "*.filter.fq" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/AdapterRemoval/fastpOut", mode: "copy", overwrite: true, pattern: "*.fastp.{json,html}" input: tuple sample_id, file(reads) from reads_ch @@ -572,7 +671,7 @@ if (params.Analyze) { output: tuple sample_id, file("*.fastp.{json,html}") into fastp_results tuple sample_id, file("*.filter.fq") into reads_fastp_ch - file("*.csv") into fastp_csv + file("*.csv") into ( fastp_csv_in1, fastp_csv_in2 ) script: """ @@ -589,6 +688,7 @@ if (params.Analyze) { reads_ch .set{ reads_fastp_ch } fastp_results = Channel.empty() + } if (!params.skipPrimerRemoval) { @@ -599,7 +699,7 @@ if (params.Analyze) { tag "${sample_id}" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/PrimerRemoval", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/PrimerRemoval", mode: "copy", overwrite: true input: tuple sample_id, file(reads) from reads_fastp_ch @@ -621,7 +721,7 @@ if (params.Analyze) { RTRIM=\$( echo ${GlobTrim} | cut -f 2 -d "," ) bbduk.sh in=${reads[0]} out=${sample_id}_bb_R1.fastq.gz ftl=\${FTRIM} t=${task.cpus} bbduk.sh in=${reads[1]} out=${sample_id}_bb_R2.fastq.gz ftl=\${RTRIM} t=${task.cpus} - repair.sh in1=${sample_id}_bb_R1.fastq.gz in2=${sample_id}_bb_R2.fastq.gz out1=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz outs=sing.fq repair + repair.sh in1=${sample_id}_bb_R1.fastq.gz in2=${sample_id}_bb_R2.fastq.gz out1=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz outs=sing.fq repair """ } else if ( params.multi && params.primers ) { """ @@ -636,9 +736,10 @@ if (params.Analyze) { } else { reads_fastp_ch .set{ reads_bbduk_ch } + } - if (!params.skipFastQC) { + if (!params.skipFastQC && !params.skipPrimerRemoval) { process QualityCheck_2 { @@ -646,13 +747,13 @@ if (params.Analyze) { tag "${sample_id}" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/FastQC/PostClean", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/FastQC/PostClean", mode: "copy", overwrite: true input: tuple sample_id, file(reads) from readsforqc2 output: - tuple sample_id, file("*_fastqc.{zip,html}") into fastqc2_results_OAS + tuple sample_id, file("*_fastqc.{zip,html}") into fastqc2_results script: """ @@ -660,38 +761,50 @@ if (params.Analyze) { """ } } + } else { + reads_ch + .set{ reads_bbduk_ch } + } - process Read_Merging { + if (!params.skipMerging) { - label 'norm_cpus' + process Read_Merging { - tag "${sample_id}" + label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/Individual", mode: "copy", overwrite: true, pattern: "*mergedclean.fastq" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/Individual/notmerged", mode: "copy", overwrite: true, pattern: "*notmerged*.fastq" + tag "${sample_id}" - input: - tuple sample_id, file(reads) from reads_bbduk_ch + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/Individual", mode: "copy", overwrite: true, pattern: "*mergedclean.fastq" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/Individual/notmerged", mode: "copy", overwrite: true, pattern: "*notmerged*.fastq" - output: - file("*_mergedclean.fastq") into reads_vsearch1_ch - file("*.name") into names - file("*notmerged*.fastq") into notmerged + input: + tuple sample_id, file(reads) from reads_bbduk_ch - script: - """ - vsearch --fastq_mergepairs ${reads[0]} --reverse ${reads[1]} --threads ${task.cpus} --fastqout ${sample_id}_mergedclean.fastq --fastqout_notmerged_fwd ${sample_id}_notmerged_fwd.fastq --fastqout_notmerged_rev ${sample_id}_notmerged_rev.fastq --fastq_maxee ${params.maxEE} --relabel ${sample_id}. - echo ${sample_id} > ${sample_id}.name - """ + output: + file("*_mergedclean.fastq") into reads_vsearch1_ch + file("*.name") into names + file("*notmerged*.fastq") into notmerged + + script: + """ + vsearch --fastq_mergepairs ${reads[0]} --reverse ${reads[1]} --threads ${task.cpus} --fastqout ${sample_id}_mergedclean.fastq --fastqout_notmerged_fwd ${sample_id}_notmerged_fwd.fastq --fastqout_notmerged_rev ${sample_id}_notmerged_rev.fastq --fastq_maxdiffs ${params.diffs} --fastq_maxns ${params.maxn} --fastq_allowmergestagger --fastq_maxee ${params.maxEE} --fastq_minovlen ${params.minoverlap} --relabel ${sample_id}. + echo ${sample_id} > ${sample_id}.name + """ + + } + } else { + reads_bbduk_ch + .set{ reads_vsearch1_ch } } - process Compile_Reads { + + process Filtering_Prep1 { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true input: file(reads) from reads_vsearch1_ch @@ -706,11 +819,11 @@ if (params.Analyze) { """ } - process Compile_Names { + process Filtering_Prep2 { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging", mode: "copy", overwrite: true input: file(names) from names @@ -726,39 +839,72 @@ if (params.Analyze) { } - process Length_Filtering { //changed + process Length_Filtering { label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true, pattern: "*_merged_preFilt*.fasta" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging", mode: "copy", overwrite: true, pattern: "*Lengthfiltered.fastq" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/Histograms/pre_length_filtering", mode: "copy", overwrite: true, pattern: "*preFilt_*st.txt" - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/Histograms/post_length_filtering", mode: "copy", overwrite: true, pattern: "*postFilt_*st.txt" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true, pattern: "*_merged_preFilt*.fasta" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging", mode: "copy", overwrite: true, pattern: "*Lengthfiltered.fastq" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/Histograms/pre_length_filtering", mode: "copy", overwrite: true, pattern: "*preFilt_*st.txt" + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/Histograms/post_length_filtering", mode: "copy", overwrite: true, pattern: "*postFilt_*st.txt" input: file(reads) from collect_samples_ch output: - file("*_merged_preFilt_clean.fasta") into ( nuclCounts_mergedreads_ch, pcASV_mergedreads_ch ) - file("*_merged_clean_Lengthfiltered.fastq") into reads_vsearch2_ch file("*_merged_preFilt_clean.fastq") into ( mergeforprotcounts, mergeforpcASVaacounts ) - file("**hist.txt") into histos + + file("*_merged_preFilt_clean.fasta") into ( nuclCounts_mergedreads_asv_ch, nuclCounts_mergedreads_ncasv_ch, pcASV_mergedreads_ch ) + file("*_merged_clean_Lengthfiltered.fastq") into reads_vsearch2_ch + + file("*preFilt_preClean_baseFrequency_hist.csv") into prefilt_basefreq + file("*preFilt_preClean_qualityScore_hist.csv") into prefilt_qualityscore + file("*preFilt_preClean_gcContent_hist.csv") into prefilt_gccontent + file("*preFilt_preClean_averageQuality_hist.csv") into prefilt_averagequality + file("*preFilt_preClean_length_hist.csv") into prefilt_length + + file("*postFilt_baseFrequency_hist.csv") into postFilt_basefreq + file("*postFilt_qualityScore_hist.csv") into postFilt_qualityscore + file("*postFilt_gcContent_hist.csv") into postFilt_gccontent + file("*postFilt_averageQuaulity_hist.csv") into postFilt_averagequality + file("*postFilt_length_hist.csv") into postFilt_length + + file("reads_per_sample_preFilt_preClean.csv") into reads_per_sample_preFilt + file("read_per_sample_postFilt_postClean.csv") into reads_per_sample_postFilt script: """ - bbduk.sh in=${reads} bhist=${params.projtag}_all_merged_preFilt_preClean_baseFrequency_hist.txt qhist=${params.projtag}_all_merged_preFilt_preClean_qualityScore_hist.txt gchist=${params.projtag}_all_merged_preFilt_preClean_gcContent_hist.txt aqhist=${params.projtag}_all_merged_preFilt_preClean_averageQuality_hist.txt lhist=${params.projtag}_all_merged_preFilt__preClean_length_hist.txt gcbins=auto + # from DC + bbduk.sh in=${reads} bhist=${params.projtag}_all_merged_preFilt_preClean_baseFrequency_hist.txt qhist=${params.projtag}_all_merged_preFilt_preClean_qualityScore_hist.txt gchist=${params.projtag}_all_merged_preFilt_preClean_gcContent_hist.txt aqhist=${params.projtag}_all_merged_preFilt_preClean_averageQuality_hist.txt lhist=${params.projtag}_all_merged_preFilt_preClean_length_hist.txt gcbins=auto + for x in *preFilt*hist.txt;do + pre=\$(echo \$x | awk -F ".txt" '{print \$1}') + cat \$x | tr "\t" "," > \${pre}.csv + rm \$x + done + reformat.sh in=${reads} out=${params.projtag}_preFilt_preclean.fasta t=${task.cpus} + echo "sample,reads" >> reads_per_sample_preFilt_preClean.csv + grep ">" ${params.projtag}_preFilt_preclean.fasta | awk -F ">" '{print \$2}' | awk -F "." '{print \$1}' | sort --parallel=${task.cpus} | uniq -c | sort -brg --parallel=${task.cpus} | awk '{print \$2","\$1}' >> reads_per_sample_preFilt_preClean.csv + rm ${params.projtag}_preFilt_preclean.fasta fastp -i ${reads} -o ${params.projtag}_merged_preFilt_clean.fastq -b ${params.maxLen} -l ${params.minLen} --thread ${task.cpus} -n 1 reformat.sh in=${params.projtag}_merged_preFilt_clean.fastq out=${params.projtag}_merged_preFilt_clean.fasta t=${task.cpus} bbduk.sh in=${params.projtag}_merged_preFilt_clean.fastq out=${params.projtag}_merged_clean_Lengthfiltered.fastq minlength=${params.maxLen} maxlength=${params.maxLen} t=${task.cpus} bbduk.sh in=${params.projtag}_merged_clean_Lengthfiltered.fastq bhist=${params.projtag}_all_merged_postFilt_baseFrequency_hist.txt qhist=${params.projtag}_all_merged_postFilt_qualityScore_hist.txt gchist=${params.projtag}_all_merged_postFilt_gcContent_hist.txt aqhist=${params.projtag}_all_merged_postFilt_averageQuaulity_hist.txt lhist=${params.projtag}_all_merged_postFilt_length_hist.txt gcbins=auto - """ + for x in *postFilt*hist.txt;do + pre=\$(echo \$x | awk -F ".txt" '{print \$1}') + cat \$x | tr "\t" "," > \${pre}.csv + rm \$x + done + reformat.sh in=${params.projtag}_merged_clean_Lengthfiltered.fastq out=${params.projtag}_merged_clean_Lengthfiltered.fasta t=${task.cpus} + echo "sample,reads" >> read_per_sample_postFilt_postClean.csv + grep ">" ${params.projtag}_merged_clean_Lengthfiltered.fasta | awk -F ">" '{print \$2}' | awk -F "." '{print \$1}' | sort --parallel=${task.cpus} | uniq -c | sort -brg --parallel=${task.cpus} | awk '{print \$2","\$1}' >> read_per_sample_postFilt_postClean.csv + """ } - process Extract_Uniques { + process Extracting_Uniques { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/ReadProcessing/ReadMerging/Uniques", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ReadMerging/Uniques", mode: "copy", overwrite: true input: file(reads) from reads_vsearch2_ch @@ -776,7 +922,7 @@ if (params.Analyze) { label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/ASVs/ChimeraCheck", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/Clustering/ASVs/ChimeraCheck", mode: "copy", overwrite: true input: file(reads) from reads_vsearch3_ch @@ -794,13 +940,14 @@ if (params.Analyze) { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/ASVs", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/Clustering/ASVs", mode: "copy", overwrite: true input: file(fasta) from reads_vsearch4_ch output: - file("*ASVs.fasta") into ( reads_vsearch5_ch, nucl2aa, asvsforAminotyping, asvfastaforcounts, asvaminocheck ) + file("*ASVs.fasta") into asvforfilt + script: """ @@ -809,134 +956,648 @@ if (params.Analyze) { } // UNTIL HERE DEFAULT + if (params.filter) { + + process ASV_Filtering { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/ReadProcessing/ASVFiltering", mode: "copy", overwrite: true + + input: + file(asv) from asvforfilt + + output: + file("*ASV.fasta") into ( reads_vsearch5_ch, asv_med, nucl2aa, asvsforAminotyping, asvfastaforcounts, asvaminocheck ) + file("*.csv") into ( nothing ) + file("*diamondfilter.out") into ( noth) + script: + """ + cp ${params.vampdir}/bin/rename_seq.py . + + #create and rename filter database + grep ">" ${params.filtDB} | sed 's/ //g' | awk -F ">" '{print \$2}' >> filt.head + j=1 + for y in \$( cat filt.head );do + echo ">Filt"\$j"" >> filt.headers + j=\$(( \${j}+1 )) + done + ./rename_seq.py ${params.filtDB} filt.headers filterdatabaserenamed.fasta + cat filterdatabaserenamed.fasta >> combodatabase.fasta + paste -d',' filt.head filt.headers > filtername_map.csv + + #create and rename keep database + grep ">" ${params.keepDB} | sed 's/ //g' | awk -F ">" '{print \$2}' >> keep.head + d=1 + for y in \$( cat keep.head );do + echo ">keep"\$d"" >> keep.headers + d=\$(( \${d}+1 )) + done + ./rename_seq.py ${params.keepDB} keep.headers keepdatabaserenamed.fasta + cat keepdatabaserenamed.fasta >> combodatabase.fasta + paste -d',' keep.head keep.headers > keepername_map.csv + rm filterdatabaserenamed.fasta + #index database + diamond makedb --in combodatabase.fasta --db combodatabase.fasta + #run diamond_db + diamond blastx -q ${asv} -d combodatabase.fasta -p ${task.cpus} --id ${params.filtminID} -l ${params.filtminaln} -e ${params.filtevalue} --${params.filtsensitivity} -o ${params.projtag}_diamondfilter.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident --max-target-seqs 1 --max-hsps 1 + #get asvs + grep ">" ${asv} | awk -F ">" '{print \$2}' > asv.list + for x in \$(cat asv.list) + do #check for a hit + if [[ \$(grep -c "\$x" ${params.projtag}_diamondfilter.out) -eq 1 ]] + then #check if hit is to filter + hit=\$(grep "\$x" ${params.projtag}_diamondfilter.out | awk '{print \$3}') + if [[ \$(grep -c "\$hit" filt.headers) -eq 1 ]] + then echo "\$x,\$hit" >> filtered_asvs_summary.csv + elif [[ \$(grep -c "\$hit" keep.headers) -eq 1 ]] + then echo "\$x" >> kep.list + fi + else echo \$x >> nohit.list + fi + done + if [[ ${params.keepnohit} == "true" ]] + then cat nohit.list >> kep.list + cat kep.list | sort >> keep.list + seqtk subseq ${asv} keep.list > kept.fasta + u=1 + for y in \$( cat keep.list );do + echo ">ASV\${u}" >> asvrename.list + u=\$(( \${u}+1 )) + done + ./rename_seq.py ${asv} asvrename.list ${params.projtag}_ASV.fasta + else + cat kep.list | sort > keep.list + seqtk subseq ${asv} keep.list > kept.fasta + u=1 + for y in \$( cat keep.list );do + echo ">ASV\${u}" >> asvrename.list + u=\$(( \${u}+1 )) + done + ./rename_seq.py ${asv} asvrename.list ${params.projtag}_ASV.fasta + fi + paste -d',' keep.list asvrename.list > ASV_rename_map.csv + """ + } + + } else { + asvforfilt + .into{ reads_vsearch5_ch; asv_med; nucl2aa; asvsforAminotyping; asvfastaforcounts; asvaminocheck } + } - if (params.ncASV) { + if (params.DataCheck) { - process NucleotideBased_ASV_clustering { + process NucleotideBased_ASV_clustering_DC { label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/ncASV", mode: "copy", overwrite: true, pattern: '*ncASV*.fasta' + publishDir "${params.workingdir}/${params.outdir}/DataCheck/ClusteringTest/Nucleotide", mode: "copy", overwrite: true, pattern: '*{.csv}' input: file(fasta) from reads_vsearch5_ch output: - tuple file("*_ncASV*.fasta"), file("*ASV.fasta") into ( nuclFastas_forDiamond_ch, nuclFastas_forCounts_ch, nuclFastas_forMatrix_ch) - tuple file("*_ncASV*.fasta"), file("*ASV.fasta") into nuclFastas_forphylogeny + file("number_per_percentage_nucl.csv") into number_per_percent_nucl_plot script: - if (params.clusterNuclIDlist) { + if (params.datacheckntIDlist) { """ - cp ${fasta} ./${params.projtag}_ASV.fasta - for id in `echo ${params.clusterNuclIDlist} | tr "," "\\n"`;do - vsearch --cluster_fast ${params.projtag}_ASV.fasta --centroids ${params.projtag}_ncASV\${id}.fasta --threads ${task.cpus} --relabel ncASV --id \${id} + for id in `echo ${params.datacheckntIDlist} | tr "," "\\n"`;do + vsearch --cluster_fast ${fasta} --centroids ${params.projtag}_ncASV\${id}.fasta --threads ${task.cpus} --relabel OTU --id \${id} done + for x in *ncASV*.fasta;do + id=\$( echo \$x | awk -F "_ncASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') + numb=\$( grep -c ">" \$x ) + echo "\${id},\${numb}" >> number_per_percentage_nucl.csv + done + yo=\$(grep -c ">" ${fasta}) + echo "1.0,\${yo}" >> number_per_percentage_nucl.csv """ - } else if (params.clusterNuclID) { + } + } + + process Translation_For_ProteinBased_Clustering_DC { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/DataCheck/ClusteringTest/Aminoacid/translation", mode: "copy", overwrite: true + + input: + file(fasta) from nucl2aa + + output: + file("*ASVprotforclust.fasta") into clustering_aa + file("*_translation_report") into reportaa_VR + file("*_ASV_all.fasta") into asvfastaforaaclust + + script: """ - cp ${fasta} ./${params.projtag}_ASV.fasta - id=${params.clusterNuclID} - vsearch --cluster_fast ${params.projtag}_ASV.fasta --centroids ${params.projtag}_ncASV\${id}.fasta --threads ${task.cpus} --relabel ncASV --id \${id} + ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASVprotforclust.fasta --report ${params.projtag}_translation_report + cp ${fasta} ${params.projtag}_ASV_all.fasta + """ + } + + process Protein_clustering_DC { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/DataCheck/ClusteringTest/Aminoacid", mode: "copy", overwrite: true, pattern: '*{.csv}' + + input: + file(fasta) from clustering_aa + file(asvs) from asvfastaforaaclust + + output: + file("number_per_percentage_prot.csv") into number_per_percent_prot_plot + file("*aminoacid_pcASV1.0_noTaxonomy.fasta") into amino_med + script: + // add awk script to count seqs + """ + set +e + cp ${params.vampdir}/bin/rename_seq.py . + for id in `echo ${params.datacheckaaIDlist} | tr "," "\\n"`;do + if [ \${id} == ".55" ];then + word=3 + elif [ \${id} == ".65" ];then + word=4 + else + word=5 + fi + awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${fasta} > ${params.projtag}_filtered_proteins.fasta + cd-hit -i ${params.projtag}_filtered_proteins.fasta -n \${word} -c \${id} -o ${params.projtag}_pcASV\${id}.fasta + sed 's/>Cluster />Cluster_/g' ${params.projtag}_pcASV\${id}.fasta.clstr >${params.projtag}_pcASV\${id}.clstr + grep ">Cluster_" ${params.projtag}_pcASV\${id}.clstr >temporaryclusters.list + y=\$(grep -c ">Cluster_" ${params.projtag}_pcASV\${id}.clstr) + echo ">Cluster_"\${y}"" >> ${params.projtag}_pcASV\${id}.clstr + t=1 + b=1 + for x in \$(cat temporaryclusters.list);do + echo "Extracting \$x" + name="\$( echo \$x | awk -F ">" '{print \$2}')" + clust="pcASV"\${t}"" + echo "\${name}" + awk '/^>'\${name}'\$/,/^>Cluster_'\${b}'\$/' ${params.projtag}_pcASV\${id}.clstr > "\${name}"_"\${clust}"_tmp.list + t=\$(( \${t}+1 )) + b=\$(( \${b}+1 )) + done + ls *_tmp.list + u=1 + for x in *_tmp.list;do + name="\$(echo \$x | awk -F "_p" '{print \$1}')" + echo "\${name}" + cluster="\$(echo \$x | awk -F "_" '{print \$3}')" + echo "\${cluster}" + grep "ASV" \$x | awk -F ", " '{print \$2}' | awk -F "_" '{print \$1}' | awk -F ">" '{print \$2}' > \${name}_\${cluster}_seqs_tmps.list + seqtk subseq ${asvs} \${name}_\${cluster}_seqs_tmps.list > \${name}_\${cluster}_nucleotide_sequences.fasta + vsearch --cluster_fast \${name}_\${cluster}_nucleotide_sequences.fasta --id 0.2 --centroids \${name}_\${cluster}_centroids.fasta + grep ">" \${name}_\${cluster}_centroids.fasta >> \${name}_\${cluster}_tmp_centroids.list + for y in \$( cat \${name}_\${cluster}_tmp_centroids.list );do + echo ">\${cluster}_type"\$u"" >> \${name}_\${cluster}_tmp_centroid.newheaders + u=\$(( \${u}+1 )) + done + u=1 + ./rename_seq.py \${name}_\${cluster}_centroids.fasta \${name}_\${cluster}_tmp_centroid.newheaders \${cluster}_types_labeled.fasta + done + cat *_types_labeled.fasta >> ${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta + grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list + grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list + paste -d"," temporaryclusters.list tmphead.list >tmp.info.csv + grep ">" ${params.projtag}_pcASV\${id}.fasta >lala.list + j=1 + for x in \$(cat lala.list);do + echo ">${params.projtag}_pcASV\${j}" >>${params.projtag}_aminoheaders.list + echo "\${x},>${params.projtag}_pcASV\${j}" >>tmpaminotype.info.csv + j=\$(( \${j}+1 )) + done + rm lala.list + awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list + for x in \$(cat tmporder.list);do + grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list + done + paste -d "," temporaryclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_pcASVCluster\${id}_summary.csv + ./rename_seq.py ${params.projtag}_pcASV\${id}.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta + stats.sh in=${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_aminoacid_clustered.gc gcformat=4 overwrite=true + stats.sh in=${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_nucleotide_clustered.gc gcformat=4 overwrite=true + awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${fasta} >${params.projtag}_pcASV\${id}_problematic_translations.fasta + if [ `wc -l ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then + grep ">" ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list + seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_pcASV\${id}_problematic_nucleotides.fasta + else + rm ${params.projtag}_pcASV\${id}_problematic_translations.fasta + fi + rm *.list + rm Cluster* + rm *types* + rm *tmp* + rm ${params.projtag}_pcASV\${id}.fast* + done + for x in *aminoacid*noTaxonomy.fasta;do + id=\$( echo \$x | awk -F "_noTax" '{print \$1}' | awk -F "pcASV" '{print \$2}') + numb=\$( grep -c ">" \$x) + echo "\${id},\${numb}" >> number_per_percentage_protz.csv + done + yesirr=\$( wc -l number_per_percentage_protz.csv | awk '{print \$1}') + tail -\$(( \${yesirr}-1 )) number_per_percentage_protz.csv > number_per_percentage_prot.csv + head -1 number_per_percentage_protz.csv >> number_per_percentage_prot.csv + rm number_per_percentage_protz.csv """ - } } - } else { - reads_vsearch5_ch - .into{ nuclFastas_forDiamond_ch; nuclFastas_forCounts_ch; nuclFastas_forphylogeny; nuclFastas_forMatrix_ch } - } - if (!params.skipTaxonomy) { + if (params.asvMED) { - if (params.ncASV) { + process ASV_Shannon_Entropy_Analysis { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/DataCheck/ClusteringTest/Nucleotide/ShannonEntropy", mode: "copy", overwrite: true + + input: + file(asvs) from asv_med + + output: + + file("*_ASV_entropy_breakdown.csv") into asv_entro_csv + file("*Aligned_informativeonly.fasta-ENTROPY") into asv_entropy + file("*ASV*") into entrop + + script: + """ + set +e + #alignment + ${tools}/muscle5.0.1278_linux64 -in ${asvs} -out ${params.projtag}_ASVs_muscleAlign.fasta -threads ${task.cpus} -quiet + #trimming + trimal -in ${params.projtag}_ASVs_muscleAlign.fasta -out ${params.projtag}_ASVs_muscleAligned.fasta -keepheader -fasta -automated1 + rm ${params.projtag}_ASVs_muscleAlign.fasta + o-trim-uninformative-columns-from-alignment ${params.projtag}_ASVs_muscleAligned.fasta + mv ${params.projtag}_ASVs_muscleAligned.fasta-TRIMMED ./${params.projtag}_ASVs_Aligned_informativeonly.fasta + #entopy analysis + entropy-analysis ${params.projtag}_ASVs_Aligned_informativeonly.fasta + #summarize entropy peaks + awk '{print \$2}' ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY >> tmp_value.list + for x in \$(cat tmp_value.list) + do echo "\$x" + if [[ \$(echo "\$x > 0.0"|bc -l) -eq 1 ]]; + then echo dope >> above-0.0-.list + fi + if [[ \$(echo "\$x > 0.1"|bc -l) -eq 1 ]]; + then echo dope >> above-0.1-.list + fi + if [[ \$(echo "\$x > 0.2"|bc -l) -eq 1 ]]; + then echo dope >> above-0.2-.list + fi + if [[ \$(echo "\$x > 0.3"|bc -l) -eq 1 ]]; + then echo dope >> above-0.3-.list + fi + if [[ \$(echo "\$x > 0.4"|bc -l) -eq 1 ]]; + then echo dope >> above-0.4-.list + fi + if [[ \$(echo "\$x > 0.5"|bc -l) -eq 1 ]]; + then echo dope >> above-0.5-.list + fi + if [[ \$(echo "\$x > 0.6"|bc -l) -eq 1 ]]; + then echo dope >> above-0.6-.list + fi + if [[ \$(echo "\$x > 0.7"|bc -l) -eq 1 ]]; + then echo dope >> above-0.7-.list + fi + if [[ \$(echo "\$x > 0.8"|bc -l) -eq 1 ]]; + then echo dope >> above-0.8-.list + fi + if [[ \$(echo "\$x > 0.9"|bc -l) -eq 1 ]]; + then echo dope >> above-0.9-.list + fi + if [[ \$(echo "\$x > 1.0"|bc -l) -eq 1 ]]; + then echo dope >> above-1.0-.list + fi + if [[ \$(echo "\$x > 1.1"|bc -l) -eq 1 ]]; + then echo dope >> above-1.1-.list + fi + if [[ \$(echo "\$x > 1.2"|bc -l) -eq 1 ]]; + then echo dope >> above-1.2-.list + fi + if [[ \$(echo "\$x > 1.3"|bc -l) -eq 1 ]]; + then echo dope >> above-1.3-.list + fi + if [[ \$(echo "\$x > 1.4"|bc -l) -eq 1 ]]; + then echo dope >> above-1.4-.list + fi + if [[ \$(echo "\$x > 1.5"|bc -l) -eq 1 ]]; + then echo dope >> above-1.5-.list + fi + if [[ \$(echo "\$x > 1.6"|bc -l) -eq 1 ]]; + then echo dope >> above-1.6-.list + fi + if [[ \$(echo "\$x > 1.7"|bc -l) -eq 1 ]]; + then echo dope >> above-1.7-.list + fi + if [[ \$(echo "\$x > 1.8"|bc -l) -eq 1 ]]; + then echo dope >> above-1.8-.list + fi + if [[ \$(echo "\$x > 1.9"|bc -l) -eq 1 ]]; + then echo dope >> above-1.9-.list + fi + if [[ \$(echo "\$x > 2.0"|bc -l) -eq 1 ]]; + then echo dope >> above-2.0-.list + fi + done + echo "Entropy,Peaks_above" >> ${params.projtag}_ASV_entropy_breakdown.csv + for z in above*.list; + do entrop=\$(echo \$z | awk -F "-" '{print \$2}') + echo ""\$entrop", "\$(wc -l \$z | awk '{print \$1}')"" >> ${params.projtag}_ASV_entropy_breakdown.csv + done + rm above* + mv ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY ./tmp2.tsv + cat tmp2.tsv | tr "\\t" "," > tmp.csv + rm tmp2.tsv + echo "Base_position,Shannons_Entropy" >> ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY + cat tmp.csv >> ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY + rm tmp.csv + + """ + + } + } else { + asv_entropy = Channel.empty() + asv_entro_csv = Channel.empty() + } + + if (params.aminoMED) { + + process AminoType_Shannon_Entropy_Analysis { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/DataCheck/ClusteringTest/Aminoacid/ShannonEntropy", mode: "copy", overwrite: true, pattern: '*{.csv}' + + input: + file(aminos) from amino_med + + output: + file("*AminoType_entropy_breakdown.csv") into amino_entro_csv + file ("*Aligned_informativeonly.fasta-ENTROPY") into amino_entropy + file("*AminoTypes*") into aminos + + script: + """ + #alignment + if [[ \$(grep -c ">" ${aminos}) -gt 499 ]]; then algo="super5"; else algo="mpc"; fi + ${tools}/muscle5.0.1278_linux64 -\${algo} ${aminos} -out ${params.projtag}_AminoTypes_muscleAlign.fasta -threads ${task.cpus} -quiet + #trimming + trimal -in ${params.projtag}_AminoTypes_muscleAlign.fasta -out ${params.projtag}_AminoTypes_muscleAligned.fasta -keepheader -fasta -automated1 + rm ${params.projtag}_AminoTypes_muscleAlign.fasta + o-trim-uninformative-columns-from-alignment ${params.projtag}_AminoTypes_muscleAligned.fasta + mv ${params.projtag}_AminoTypes_muscleAligned.fasta-TRIMMED ./${params.projtag}_AminoTypes_Aligned_informativeonly.fasta + #entropy analysis + entropy-analysis ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta --amino-acid-sequences + #summarize entropy peaks + awk '{print \$2}' ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY >> tmp_value.list + for x in \$(cat tmp_value.list) + do echo "\$x" + if [[ \$(echo "\$x > 0.0"|bc -l) -eq 1 ]]; + then echo dope >> above-0.0-.list + fi + if [[ \$(echo "\$x > 0.1"|bc -l) -eq 1 ]]; + then echo dope >> above-0.1-.list + fi + if [[ \$(echo "\$x > 0.2"|bc -l) -eq 1 ]]; + then echo dope >> above-0.2-.list + fi + if [[ \$(echo "\$x > 0.3"|bc -l) -eq 1 ]]; + then echo dope >> above-0.3-.list + fi + if [[ \$(echo "\$x > 0.4"|bc -l) -eq 1 ]]; + then echo dope >> above-0.4-.list + fi + if [[ \$(echo "\$x > 0.5"|bc -l) -eq 1 ]]; + then echo dope >> above-0.5-.list + fi + if [[ \$(echo "\$x > 0.6"|bc -l) -eq 1 ]]; + then echo dope >> above-0.6-.list + fi + if [[ \$(echo "\$x > 0.7"|bc -l) -eq 1 ]]; + then echo dope >> above-0.7-.list + fi + if [[ \$(echo "\$x > 0.8"|bc -l) -eq 1 ]]; + then echo dope >> above-0.8-.list + fi + if [[ \$(echo "\$x > 0.9"|bc -l) -eq 1 ]]; + then echo dope >> above-0.9-.list + fi + if [[ \$(echo "\$x > 1.0"|bc -l) -eq 1 ]]; + then echo dope >> above-1.0-.list + fi + if [[ \$(echo "\$x > 1.1"|bc -l) -eq 1 ]]; + then echo dope >> above-1.1-.list + fi + if [[ \$(echo "\$x > 1.2"|bc -l) -eq 1 ]]; + then echo dope >> above-1.2-.list + fi + if [[ \$(echo "\$x > 1.3"|bc -l) -eq 1 ]]; + then echo dope >> above-1.3-.list + fi + if [[ \$(echo "\$x > 1.4"|bc -l) -eq 1 ]]; + then echo dope >> above-1.4-.list + fi + if [[ \$(echo "\$x > 1.5"|bc -l) -eq 1 ]]; + then echo dope >> above-1.5-.list + fi + if [[ \$(echo "\$x > 1.6"|bc -l) -eq 1 ]]; + then echo dope >> above-1.6-.list + fi + if [[ \$(echo "\$x > 1.7"|bc -l) -eq 1 ]]; + then echo dope >> above-1.7-.list + fi + if [[ \$(echo "\$x > 1.8"|bc -l) -eq 1 ]]; + then echo dope >> above-1.8-.list + fi + if [[ \$(echo "\$x > 1.9"|bc -l) -eq 1 ]]; + then echo dope >> above-1.9-.list + fi + if [[ \$(echo "\$x > 2.0"|bc -l) -eq 1 ]]; + then echo dope >> above-2.0-.list + fi + done + echo "Entropy,Peaks_above" >> ${params.projtag}_AminoType_entropy_breakdown.csv + for z in above*.list; + do entrop=\$(echo \$z | awk -F "-" '{print \$2}') + echo ""\$entrop", "\$(wc -l \$z | awk '{print \$1}')"" >> ${params.projtag}_AminoType_entropy_breakdown.csv + done + rm above* + mv ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY ./tmp2.tsv + cat tmp2.tsv | tr "\t" "," > tmp.csv + rm tmp2.tsv + echo "Base_position,Shannons_Entropy" >> ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY + cat tmp.csv >> ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY + rm tmp.csv + """ + } - process Nucleotide_Taxonomy_Inference { + } else { + amino_entro_csv = Channel.empty() + amino_entropy = Channel.empty() + } - label 'high_cpus' + if (!params.skipReadProcessing || !params.skipMerging ) { - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy", mode: "copy", overwrite: true, pattern: '*_ASV*.{fasta,csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy", mode: "copy", overwrite: true, pattern: '*ncASV*.{fasta,csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*_ASV*dmd.out' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*ncASV*dmd.out' + process combine_csv_DC { input: - tuple file(notus), file(asvs) from nuclFastas_forDiamond_ch + file(csv) from fastp_csv_in1 + .collect() output: - file("*.fasta") into tax_labeled_fasta - tuple file("*_phyloseqObject.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond - file("*ncASV*summary_for_plot.csv") into taxplot1a - file("*_ASV*_summary_for_plot.csv") into taxplot1 + file("final_reads_stats.csv") into fastp_csv_dc script: """ - cp ${params.vampdir}/bin/rename_seq.py . - virdb=${params.dbdir}/${params.dbname} - grep ">" \${virdb} > headers.list - headers="headers.list" - for filename in ${notus};do - name=\$(ls \${filename} | awk -F ".fasta" '{print \$1}') - diamond blastx -q \${filename} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 - echo "Preparing lists to generate summary .csv's" - echo "[Best hit accession number]" > access.list - echo "[e-value]" > evalue.list - echo "[Bitscore]" > bit.list - echo "[Percent ID (aa)]" > pid.list - echo "[Organism ID]" > "\$name"_virus.list - echo "[Gene]" > "\$name"_genes.list - grep ">" \${filename} | awk -F ">" '{print \$2}' > seqids.lst - echo "extracting genes and names" - touch new_"\$name"_asvnames.txt - j=1 - if [ `echo \${filename} | grep -c "ncASV"` -eq 1 ];then - echo "[ncASV#]" > otu.list - echo "[ncASV sequence length]" > length.list - for s in \$(cat seqids.lst);do - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ncASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ncASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + cat ${csv} >all_reads_stats.csv + head -n1 all_reads_stats.csv >tmp.names.csv + cat all_reads_stats.csv | grep -v ""Sample,Total_"" >tmp.reads.stats.csv + cat tmp.names.csv tmp.reads.stats.csv >final_reads_stats.csv + rm tmp.names.csv tmp.reads.stats.csv + """ + + } + } else { + + process skip_combine_csv_DC { + output: + file("filter_reads.txt") into fastp_csv_dc + + script: + """ + echo "Read processing steps skipped." >filter_reads.txt + """ + } + } + + report_dc_in = Channel.create() + fastp_csv_dc.mix( reads_per_sample_preFilt, reads_per_sample_postFilt, prefilt_basefreq, postFilt_basefreq, prefilt_qualityscore, postFilt_qualityscore, prefilt_gccontent, postFilt_gccontent, prefilt_averagequality, postFilt_averagequality, prefilt_length, postFilt_length, number_per_percent_nucl_plot, number_per_percent_prot_plot, amino_entro_csv, amino_entropy, asv_entro_csv, asv_entropy).into(report_dc_in) + + process Report_DataCheck { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/DataCheck/Report", mode: "copy", overwrite: true, pattern: '*.{html}' + + input: + file(files) from report_dc_in + .collect() + + output: + file("*.html") into datacheckreport + + script: + """ + cp ${params.vampdir}/bin/vAMPirus_DC_Report.Rmd . + cp ${params.vampdir}/example_data/conf/vamplogo.png . + Rscript -e "rmarkdown::render('vAMPirus_DC_Report.Rmd',output_file='${params.projtag}_DataCheck_Report.html')" ${params.projtag} \ + ${params.skipReadProcessing} \ + ${params.skipMerging} \ + ${params.skipAdapterRemoval} \ + ${params.asvMED} \ + ${params.aminoMED} + """ + } + + } else if (params.Analyze) { + + if (params.ncASV) { + + reads_vsearch5_ch + .into{ asv_file_for_ncasvs; nuclFastas_forDiamond_asv_ch; nuclFastas_forCounts_asv_ch; nuclFastas_forphylogeny_asv; nuclFastas_forMatrix_asv_ch; asv_for_med } + + process NucleotideBased_ASV_clustering { + + label 'norm_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/ncASV", mode: "copy", overwrite: true, pattern: '*ncASV*.fasta' + + input: + each x from 1..nnuc + file(fasta) from asv_file_for_ncasvs + + output: + tuple nid, file("*_ncASV*.fasta") into ( nuclFastas_forphylogeny_ncasv, nuclFastas_forDiamond_ncasv_ch, nuclFastas_forCounts_ncasv_ch, nuclFastas_forMatrix_ncasv_ch ) + + script: + nid=slist.get(x-1) + mtag="ID=" + slist.get(x-1) + """ + vsearch --cluster_fast ${fasta} --centroids ${params.projtag}_ncASV${nid}.fasta --threads ${task.cpus} --relabel ncASV --id .${nid} + """ + } + + if (!params.skipTaxonomy) { + + if (params.dbtype == "NCBI") { + + process ncASV_Taxonomy_Inference_NCBI { /////// editttt + + label 'high_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy", mode: "copy", overwrite: true, pattern: '*ncASV*.{fasta,csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*ncASV*dmd.out' + + input: + tuple nid, file(asvs) from nuclFastas_forDiamond_ncasv_ch + + output: + file("*.fasta") into tax_labeled_fasta_ncasv + tuple file("*_phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond_ncasv + tuple nid, file("*ncASV*summary_for_plot.csv") into taxplot_ncasv + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_ncasv + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_ncasv + + script: + mtag="ID=" + nid + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + if [[ ${params.ncbitax} == "true" ]] + then diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop staxids sskingdoms skingdoms sphylums --max-target-seqs 1 --max-hsps 1 + else diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + fi + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[ncASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "[NCBI Taxonomy ID],[Taxonomic classification from NCBI]" > ncbi_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Checking for \$s hit in diamond output" + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then echo "Yep, there was a hit for \$s" echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') echo "\$s" >> otu.list echo "\$acc" >> access.list line="\$(grep -w "\$s" "\$name"_dmd.out)" @@ -945,14 +1606,24 @@ if (params.Analyze) { echo "\$line" | awk '{print \$12}' >> pid.list echo "\$line" | awk '{print \$2}' >> length.list echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ncASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) + echo ">"\${s}"_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "\$line" | awk -F "\t" '{print \$14","\$16"::"\$18"::"\$17}' >> ncbi_classification.list + fi echo "\$s done." - else + else echo "Ugh, there was no hit for \$s .." echo "We still love \$s though and we will add it to the final fasta file" echo "\$s" >> otu.list @@ -965,17 +1636,19 @@ if (params.Analyze) { echo "NO_HIT" >> length.list virus="NO" gene="HIT" - echo ">ncASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + if [[ "${params.ncbitax}" == "true" ]] + then echo "N/A" >> ncbi_classification.list + fi echo "\$s done." - fi - fi - echo "Done with \$s" - done - fi + fi + done echo "Now editing "\$name" fasta headers" ###### rename_seq.py - ./rename_seq.py \${filename} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta echo "[Sequence header]" > newnames.list cat new_"\$name"_asvnames.txt >> newnames.list @@ -983,9 +1656,27 @@ if (params.Analyze) { echo " " > sequence.list grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloseqObject.csv - paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do + if [[ "${params.lca}" == "T" && "${params.ncbitax}" == "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.lca}" == "T" && "${params.ncbitax}" != "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.ncbitax}" == "true" && "${params.lca}" != "T"]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + else + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + echo "skipped" >> \${name}_quick_Taxbreakdown.csv + fi + for x in *phyloformat.csv;do echo "\$x" lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; @@ -993,17 +1684,292 @@ if (params.Analyze) { cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; rm tmpcol.list tmp2col.list done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst otu.list ; - rm *asvnames.txt - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: \${filename} " - done - for filename in ${asvs};do - name=\$(ls \${filename} | awk -F ".fasta" '{print \$1}') - diamond blastx -q \${filename} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } else if (params.dbtype== "RVDB") { + + process ncASV_Taxonomy_Inference_RVDB { /////// editttt + + label 'high_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy", mode: "copy", overwrite: true, pattern: '*ncASV*.{fasta,csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*ncASV*dmd.out' + + input: + tuple nid, file(asvs) from nuclFastas_forDiamond_ncasv_ch + + output: + file("*.fasta") into tax_labeled_fasta_ncasv + tuple file("*_phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond_ncasv + tuple nid, file("*ncASV*summary_for_plot.csv") into taxplot_ncasv + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_ncasv + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_ncasv + + script: + mtag="ID=" + nid + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[ncASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Using RVDB headers." + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + echo "\$s done." + fi + echo "Done with \$s" + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" ]] + then paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list >> \${name}_quick_Taxbreakdown.csv + else paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } + } + + + process Generate_ncASV_Counts_Table { + + label 'norm_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Counts", mode: "copy", overwrite: true, pattern: '*_ASV*.{biome,csv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Counts", mode: "copy", overwrite: true, pattern: '*ncASV*.{biome,csv}' + + input: + tuple nid, file(notus) from nuclFastas_forCounts_ncasv_ch + file(merged) from nuclCounts_mergedreads_ncasv_ch + + output: + tuple file("*_counts.csv"), file("*_counts.biome") into counts_vsearch_ncasv + tuple nid, file("*ncASV*counts.csv") into notu_counts_plots + + script: + mtag="ID=" + nid + """ + name=\$( echo ${notus} | awk -F ".fasta" '{print \$1}') + vsearch --usearch_global ${merged} --db ${notus} --id .${nid} --threads ${task.cpus} --otutabout \${name}_counts.txt --biomout \${name}_counts.biome + cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv + sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv + rm \${name}_count.csv + """ + } + + process Generate_ncASV_Matrices { + label 'low_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Matrices", mode: "copy", overwrite: true, pattern: '*ncASV*PercentID.matrix' + + input: + tuple nid, file(asvs) from nuclFastas_forMatrix_ncasv_ch + + output: + file("*.matrix") into clustmatrices_ncasv + tuple nid, file("*ncASV*PercentID.matrix") into notu_heatmap + + script: + mtag="ID=" + nid + """ + name=\$( echo ${asvs}| awk -F ".fasta" '{print \$1}') + clustalo -i ${asvs} --distmat-out=\${name}_PairwiseDistance.matrix --full --force --threads=${task.cpus} + clustalo -i ${asvs} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} + cat \${name}_PercentIDq.matrix | tr " " "," | grep "," >\${name}_PercentID.matrix + rm \${name}_PercentIDq.matrix + """ + } + + if (!params.skipPhylogeny) { + + process ncASV_Phylogeny { + + label 'norm_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*ncASV*aln.*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*ncASV*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*ncASV*iq*' + + input: + tuple nid, file(asvs) from nuclFastas_forphylogeny_ncasv + + output: + tuple nid, file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into align_results_ncasv + tuple nid, file("*iq.treefile") into nucl_phyl_plot_ncasv + + script: + mtag="ID=" + nid + """ + pre=\$(echo ${asvs} | awk -F ".fasta" '{print \$1}' ) + ${tools}/muscle5.0.1278_linux64 -in ${asvs} -out \${pre}_ALN.fasta -threads ${task.cpus} -quiet + trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + o-trim-uninformative-columns-from-alignment \${pre}_aln.fasta + mv \${pre}_aln.fasta-TRIMMED ./\${pre}_Aligned_informativeonly.fasta + # Nucleotide_ModelTest + modeltest-ng -i \${pre}_Aligned_informativeonly.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint + # Nucleotide_Phylogeny + if [ "${params.iqCustomnt}" != "" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq --redo -t \${pre}_mt.tree -T auto ${params.iqCustomnt} + elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} + elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} + elif [ "${params.parametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni + else + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni + fi + """ + } + + } + + } else { + reads_vsearch5_ch + .into{ nuclFastas_forDiamond_asv_ch; nuclFastas_forCounts_asv_ch; nuclFastas_forphylogeny_asv; nuclFastas_forMatrix_asv_ch; asv_for_med } + } + + if (!params.skipTaxonomy) { + + if (params.dbtype == "NCBI") { + + process ASV_Taxonomy_Inference_NCBI { /////// editttt + + label 'high_cpus' + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy", mode: "copy", overwrite: true, pattern: '*_ASV*.{fasta,csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*_ASV*dmd.out' + + input: + file(asvs) from nuclFastas_forDiamond_asv_ch + + output: + file("*.fasta") into tax_labeled_fasta_asv + tuple file("*_phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond_asv + file("*_ASV*_summary_for_plot.csv") into taxplot_asv + file("*_quick_Taxbreakdown.csv") into tax_table_asv + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_asv + + script: + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + if [[ ${params.ncbitax} == "true" ]] + then diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop staxids sskingdoms skingdoms sphylums --max-target-seqs 1 --max-hsps 1 + else diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + fi echo "Preparing lists to generate summary .csv's" echo "[Best hit accession number]" > access.list echo "[e-value]" > evalue.list @@ -1011,18 +1977,23 @@ if (params.Analyze) { echo "[Percent ID (aa)]" > pid.list echo "[Organism ID]" > "\$name"_virus.list echo "[Gene]" > "\$name"_genes.list - grep ">" \${filename} | awk -F ">" '{print \$2}' > seqids.lst + echo "[ASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "[NCBI Taxonomy ID],[Taxonomic classification from NCBI]" > ncbi_classification.list + fi echo "extracting genes and names" touch new_"\$name"_asvnames.txt - j=1 - if [ `echo \${filename} | grep -c "ASV"` -eq 1 ];then - for s in \$(cat seqids.lst);do - echo "[ASV#]" > otu.list - echo "[ASV sequence length]" > length.list - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + for s in \$(cat seqids.lst);do + echo "Checking for \$s hit in diamond output" + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then echo "Yep, there was a hit for \$s" echo "Extracting the information now:" acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') @@ -1038,48 +2009,20 @@ if (params.Analyze) { echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) + echo ">"\${s}"_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "\$line" | awk -F "\t" '{print \$14","\$16"::"\$18"::"\$17}' >> ncbi_classification.list + fi echo "\$s done." - else + else echo "Ugh, there was no hit for \$s .." echo "We still love \$s though and we will add it to the final fasta file" echo "\$s" >> otu.list @@ -1092,17 +2035,19 @@ if (params.Analyze) { echo "NO_HIT" >> length.list virus="NO" gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + if [[ "${params.ncbitax}" == "true" ]] + then echo "N/A" >> ncbi_classification.list + fi echo "\$s done." - fi - fi - echo "Done with \$s" - done - fi + fi + done echo "Now editing "\$name" fasta headers" ###### rename_seq.py - ./rename_seq.py \${filename} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta echo "[Sequence header]" > newnames.list cat new_"\$name"_asvnames.txt >> newnames.list @@ -1110,9 +2055,27 @@ if (params.Analyze) { echo " " > sequence.list grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloseqObject.csv - paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do + if [[ "${params.lca}" == "T" && "${params.ncbitax}" == "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.lca}" == "T" && "${params.ncbitax}" != "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.ncbitax}" == "true" && "${params.lca}" != "T"]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + else + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + echo "skipped" >> \${name}_quick_Taxbreakdown.csv + fi + for x in *phyloformat.csv;do echo "\$x" lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; @@ -1120,345 +2083,183 @@ if (params.Analyze) { cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; rm tmpcol.list tmp2col.list done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst otu.list ; - rm *asvnames.txt - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: \${filename} " - done - rm headers.list - """ - } + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } else if (params.dbtype== "RVDB") { - } else { + process ASV_Taxonomy_Inference_RVDB { /////// editttt - process ASV_Taxonomy_Inference { + label 'high_cpus' - label 'high_cpus' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy", mode: "copy", overwrite: true, pattern: '*_ASV*.{fasta,csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*_ASV*dmd.out' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy", mode: "copy", overwrite: true, pattern: '*_ASV*.{fasta,csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*_ASV*dmd.out' + input: + file(asvs) from nuclFastas_forDiamond_asv_ch - input: - file(reads) from nuclFastas_forDiamond_ch + output: + file("*.fasta") into tax_labeled_fasta_asv + tuple file("*_phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond_asv + file("*_ASV*_summary_for_plot.csv") into taxplot_asv + file("*_quick_Taxbreakdown.csv") into tax_table_asv + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_asv - output: - file("*.fasta") into tax_labeled_fasta - tuple file("*_phyloseqObject.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_diamond - file("*_ASV*_summary_for_plot.csv") into taxplot1 - script: - """ - cp ${params.vampdir}/bin/rename_seq.py . - virdb=${params.dbdir}/${params.dbname} - grep ">" \${virdb} > headers.list - headers="headers.list" - for filename in ${reads};do - name=\$(ls \${filename} | awk -F ".fasta" '{print \$1}') - diamond blastx -q \${filename} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 - echo "Preparing lists to generate summary .csv's" - echo "[Best hit accession number]" > access.list - echo "[e-value]" > evalue.list - echo "[Bitscore]" > bit.list - echo "[Percent ID (aa)]" > pid.list - echo "[Organism ID]" > "\$name"_virus.list - echo "[Gene]" > "\$name"_genes.list - grep ">" \${filename} | awk -F ">" '{print \$2}' > seqids.lst - echo "extracting genes and names" - touch new_"\$name"_asvnames.txt - j=1 - if [ `echo \${filename} | grep -c "ncASV"` -eq 1 ];then - echo "[ASV#]" > otu.list - echo "[ASV sequence length]" > length.list - for s in \$(cat seqids.lst);do - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - fi - echo "Done with \$s" - done - else - for s in \$(cat seqids.lst);do - echo "[ASV#]" > otu.list - echo "[ASV sequence length]" > length.list - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >> evalue.list - echo "\$line" | awk '{print \$11}' >> bit.list - echo "\$line" | awk '{print \$12}' >> pid.list - echo "\$line" | awk '{print \$2}' >> length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >> access.list - echo "NO_HIT" >> "\$name"_genes.list - echo "NO_HIT" >> "\$name"_virus.list - echo "NO_HIT" >> evalue.list - echo "NO_HIT" >> bit.list - echo "NO_HIT" >> pid.list - echo "NO_HIT" >> length.list - virus="NO" - gene="HIT" - echo ">ASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - fi - echo "Done with \$s" - done - fi - echo "Now editing "\$name" fasta headers" - ###### rename_seq.py - ./rename_seq.py \${filename} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta - awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta - echo "[Sequence header]" > newnames.list - cat new_"\$name"_asvnames.txt >> newnames.list - touch sequence.list - echo " " > sequence.list - grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list - rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloseqObject.csv - paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do - echo "\$x" - lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) - tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; - sed 's/ /_/g' tmpcol.list > tmp2col.list; - cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; - rm tmpcol.list tmp2col.list - done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst otu.list ; - rm *asvnames.txt - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: \${filename} " - done - rm headers.list - """ + script: + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + grep ">" \${virdb} > headers.list + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[ASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Using RVDB headers." + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + echo "\$s done." + fi + echo "Done with \$s" + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" ]] + then paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list >> \${name}_quick_Taxbreakdown.csv + else paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ } } } - if (params.ncASV) { - - process Generate_Counts_Tables_Nucleotide { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Counts", mode: "copy", overwrite: true, pattern: '*_ASV*.{biome,csv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Counts", mode: "copy", overwrite: true, pattern: '*ncASV*.{biome,csv}' - - input: - tuple file(notus), file(asvs) from nuclFastas_forCounts_ch - file(merged) from nuclCounts_mergedreads_ch - - output: - tuple file("*_counts.csv"), file("*_counts.biome") into counts_vsearch - file("*ncASV*counts.csv") into notu_counts_plots - file("*_ASV*counts.csv") into asv_counts_plots - - script: - """ - for filename in ${notus};do - if [ `echo \${filename} | grep -c "ncASV"` -eq 1 ];then - ident=\$( echo \${filename} | awk -F "ncASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - vsearch --usearch_global ${merged} --db \${filename} --id \${ident} --threads ${task.cpus} --otutabout \${name}_counts.txt --biomout \${name}_counts.biome - cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv - sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv - rm \${name}_count.csv - fi - done - if [ `echo ${asvs} | grep -c "ASV"` -eq 1 ];then - name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') - vsearch --usearch_global ${merged} --db ${asvs} --id ${params.asvcountID} --threads ${task.cpus} --otutabout "\$name"_counts.txt --biomout "\$name"_counts.biome - cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv - sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv - rm \${name}_count.csv - fi - """ - } - } else { - process Generate_ASV_Counts_Tables { - - label 'norm_cpus' + process Generate_ASV_Counts_Tables { - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Counts", mode: "copy", overwrite: true, pattern: '*ASV*.{biome,csv}' + label 'norm_cpus' - input: - file(asvs) from nuclFastas_forCounts_ch - file(merged) from nuclCounts_mergedreads_ch + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Counts", mode: "copy", overwrite: true, pattern: '*ASV*.{biome,csv}' - output: - tuple file("*_counts.csv"), file("*_counts.biome") into counts_vsearch - file("*_ASV*counts.csv") into asv_counts_plots + input: + file(asvs) from nuclFastas_forCounts_asv_ch + file(merged) from nuclCounts_mergedreads_asv_ch - script: - """ - if [ `echo ${asvs} | grep -c "ASV"` -eq 1 ];then - name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') - vsearch --usearch_global ${merged} --db ${asvs} --id ${params.asvcountID} --threads ${task.cpus} --otutabout "\$name"_counts.txt --biomout "\$name"_counts.biome - cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv - sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv - rm \${name}_count.csv - fi - """ - } - } + output: + tuple file("*_counts.csv"), file("*_counts.biome") into counts_vsearch_asv + file("*_ASV*counts.csv") into (asv_counts_plots, asvcount_med) - if (params.ncASV) { + script: + """ + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}' | sed 's/ASVs/ASV/g') + vsearch --usearch_global ${merged} --db ${asvs} --id .${params.asvcountID} --threads ${task.cpus} --otutabout "\$name"_counts.txt --biomout "\$name"_counts.biome + cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv + sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv + rm \${name}_count.csv + """ + } - process Generate_Nucleotide_Matrix { + process Generate_ASV_Matrices { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Matrix", mode: "copy", overwrite: true, pattern: '*_ASV*PercentID.matrix' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Matrix", mode: "copy", overwrite: true, pattern: '*ncASV*PercentID.matrix' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Matrices", mode: "copy", overwrite: true, pattern: '*ASV*PercentID.matrix' input: - tuple file(notus), file(asvs) from nuclFastas_forMatrix_ch + file(reads) from nuclFastas_forMatrix_asv_ch output: - file("*.matrix") into clustmatrices - file("*ncASV*PercentID.matrix") into notu_heatmap + file("*.matrix") into clustmatrices_asv file("*_ASV*PercentID.matrix") into asv_heatmap script: // remove if statement later (no fin) """ - for filename in ${notus};do + for filename in ${reads};do if [ `echo \${filename} | grep -c "ncASV"` -eq 1 ];then ident=\$( echo \${filename} | awk -F "ncASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') name=\$( echo \${filename}| awk -F ".fasta" '{print \$1}') @@ -1473,200 +2274,216 @@ if (params.Analyze) { cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix rm \${pre}z.matrix done + else + name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') + clustalo -i \${filename} --distmat-out=\${name}_PairwiseDistance.matrix --full --force --threads=${task.cpus} + clustalo -i \${filename} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} + for x in *q.matrix;do + pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') + ya=\$(wc -l \$x | awk '{print \$1}') + echo "\$((\$ya-1))" + tail -"\$((\$ya-1))" \$x > \${pre}z.matrix + rm \$x + cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix + rm \${pre}z.matrix + done fi done - if [ `echo ${asvs} | grep -c "_ASV"` -eq 1 ];then - name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') - clustalo -i ${asvs} --distmat-out=\${name}_PairwiseDistance.matrix --full --force --threads=${task.cpus} - clustalo -i ${asvs} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$((\$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix - done - fi """ } - } else { - process Generate_ASV_Matrix { + if (!params.skipPhylogeny) { // need to edit paths + + process ASV_Phylogeny { + + label 'norm_cpus' + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*ASV*aln.*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*ASV*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*ASV*iq*' + + input: + file(asvs) from nuclFastas_forphylogeny_asv + + output: + tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into align_results_asv + file("*iq.treefile") into (nucl_phyl_plot_asv, asvphy_med) + + script: + """ + pre=\$(echo ${asvs} | awk -F ".fasta" '{print \$1}' ) + ${tools}/muscle5.0.1278_linux64 -in ${asvs} -out \${pre}_ALN.fasta -threads ${task.cpus} -quiet + trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + o-trim-uninformative-columns-from-alignment \${pre}_aln.fasta + mv \${pre}_aln.fasta-TRIMMED ./\${pre}_Aligned_informativeonly.fasta + # Nucleotide_ModelTest + modeltest-ng -i \${pre}_Aligned_informativeonly.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint + # Nucleotide_Phylogeny + if [ "${params.iqCustomnt}" != "" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomnt} + elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} + elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} + elif [ "${params.parametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + else + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } + } + + if (params.asvMED) { + + process ASV_Minimum_Entropy_Decomposition { label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Matrix", mode: "copy", overwrite: true, pattern: '*ASV*PercentID.matrix' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/ASVs/MED", mode: "copy", overwrite: true input: - file(reads) from nuclFastas_forMatrix_ch + file(asvs) from asv_for_med output: - file("*.matrix") into clustmatrices - file("*_ASV*PercentID.matrix") into asv_heatmap + file("*_ASV_Grouping.csv") into asvgroupscsv + file("${params.projtag}_ASV_group_reps_aligned.fasta") into groupreps + file("${params.projtag}_asvMED_${params.asvC}") script: - // remove if statement later (no fin) """ - for filename in ${reads};do - if [ `echo \${filename} | grep -c "ncASV"` -eq 1 ];then - ident=\$( echo \${filename} | awk -F "ncASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') - name=\$( echo \${filename}| awk -F ".fasta" '{print \$1}') - clustalo -i \${filename} --distmat-out=\${name}_PairwiseDistance.matrix --full --force --threads=${task.cpus} - clustalo -i \${filename} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$((\$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix - done - else - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - clustalo -i \${filename} --distmat-out=\${name}_PairwiseDistance.matrix --full --force --threads=${task.cpus} - clustalo -i \${filename} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$((\$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix + #alignment + ${tools}/muscle5.0.1278_linux64 -in ${asvs} -out ${params.projtag}_ASVs_muscleAlign.fasta -threads ${task.cpus} -quiet + #trimming + trimal -in ${params.projtag}_ASVs_muscleAlign.fasta -out ${params.projtag}_ASVs_muscleAligned.fasta -keepheader -fasta -automated1 + rm ${params.projtag}_ASVs_muscleAlign.fasta + o-trim-uninformative-columns-from-alignment ${params.projtag}_ASVs_muscleAligned.fasta + mv ${params.projtag}_ASVs_muscleAligned.fasta-TRIMMED ./${params.projtag}_ASVs_Aligned_informativeonly.fasta + #entopy analysis + entropy-analysis ${params.projtag}_ASVs_Aligned_informativeonly.fasta + #Decomposition + if [[ \$(echo ${params.asvC} | grep -c ",") -eq 1 || "${params.asvSingle}" == "true" ]] + then + tag=\$(echo ${params.asvC} | sed 's/,/_/g') + oligotype ${params.projtag}_ASVs_Aligned_informativeonly.fasta ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_asvMED_"\$tag" -M 1 -C ${params.asvC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + elif [[ "${params.asvSingle}" == "true" ]] + then + tag="${params.asvC}" + oligotype ${params.projtag}_ASVs_Aligned_informativeonly.fasta ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_asvMED_"\$tag" -M 1 -C ${params.asvC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + else + oligotype ${params.projtag}_ASVs_Aligned_informativeonly.fasta ${params.projtag}_ASVs_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_asvMED_${params.asvC} -M 1 -c ${params.asvC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + fi + #generatemaps + cd ./${params.projtag}_asvMED_${params.asvC}/OLIGO-REPRESENTATIVES/ + echo "ASV,GroupID,IDPattern" + j=1 + for x in *_unique; + do gid=\$(echo \$x | awk -F "_" '{print \$1}') + uni=\$(echo \$x | awk -F ""\${gid}"_" '{print \$2}' | awk -F "_uni" '{print \$1}') + grep ">" "\$gid"_"\$uni" | awk -F ">" '{print \$2}' > asv.list + seqtk subseq ../../${asvs} asv.list > Group"\${j}"_sequences.fasta + for z in \$( cat asv.list) + do echo ""\$z",Group"\$j","\$uni"" >> ${params.projtag}_ASV_Grouping.csv + done - fi + rm asv.list + echo ">Group\${j}" >> ${params.projtag}_ASV_group_reps_aligned.fasta + echo "\$uni" > group.list + seqtk subseq ../OLIGO-REPRESENTATIVES.fasta group.list > group.fasta + tail -1 group.fasta >> ${params.projtag}_ASV_group_reps_aligned.fasta + mv "\$gid"_"\$uni" ./Group"\$j"_"\$uni"_aligned.fasta + mv "\$gid"_"\$uni"_unique ./Group"\$j"_"\$uni"_unqiues_aligned.fasta + rm "\$gid"*.cPickle + j=\$((\$j+1)) done + mv ${params.projtag}_ASV_Grouping.csv ../../ + mv ${params.projtag}_ASV_group_reps_aligned.fasta ../../ + cd .. """ } - } - if (!params.skipPhylogeny) { // need to edit paths + if (!params.skipPhylogeny) { - if (params.ncASV) { + process ASV_MED_Reps_phylogeny { - process Nucleotide_Phylogeny { + label 'low_cpus' - label 'norm_cpus' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/MED/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*ASV*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/MED/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*ASV*iq*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*ncASV*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*ncASV*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ncASV/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*ncASV*iq*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*_ASV*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*_ASV*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*_ASV*iq*' + input: + file(reps) from groupreps - input: - tuple file(notus), file(asvs) from nuclFastas_forphylogeny + output: + file("*_ASV_Group_Reps*") into align_results_asvmed + file("*iq.treefile") into asv_group_rep_tree - output: - tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into align_results - file("*iq.treefile") into nucl_phyl_plot + script: + """ + # Protein_ModelTest + modeltest-ng -i ${reps} -p ${task.cpus} -o ${params.projtag}_ASV_Group_Reps_mt -d aa -s 203 --disable-checkpoint - script: - """ - for filename in ${notus};do - pre=\$(echo \${filename} | awk -F ".fasta" '{print \$1}' ) - mafft --thread ${task.cpus} --maxiterate 15000 --auto \${filename} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html - # Nucleotide_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint - # Nucleotide_Phylogeny - if [ "${params.iqCustomnt}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -t \${pre}_mt.tree -T auto ${params.iqCustomnt} - elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} - elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - fi - done - for filename in ${asvs};do - pre=\$(echo \${filename} | awk -F ".fasta" '{print \$1}' ) - mafft --thread ${task.cpus} --maxiterate 15000 --auto \${filename} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html - # Nucleotide_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint - # Nucleotide_Phylogeny - if [ "${params.iqCustomnt}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomnt} - elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} - elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - fi - done - """ - } - } else { + # Protein_Phylogeny + if [ "${params.iqCustomaa}" != "" ];then + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq --redo -T auto ${params.iqCustomaa} - process ASV_Phylogeny { + elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 ${reps}.log | head -1 | awk '{print \$6}') + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq -m \${mod} --redo -nt auto -b ${params.boots} - label 'norm_cpus' + elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 ${reps}.log | head -1 | awk '{print \$6}') + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*ASV*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*ASV*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*ASV*iq*' + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq -m MFP --redo -nt auto -b ${params.boots} - input: - file(asvs) from nuclFastas_forphylogeny + elif [ "${params.parametric}" != "false" ];then + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - output: - tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into align_results - file("*iq.treefile") into nucl_phyl_plot + else + iqtree -s ${reps} --prefix ${params.projtag}_ASV_Group_Reps_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } + } + process Adding_ASV_MED_Info { - script: - """ - for filename in ${asvs};do - pre=\$(echo \${filename} | awk -F ".fasta" '{print \$1}' ) - mafft --thread ${task.cpus} --maxiterate 15000 --auto \${filename} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html - # Nucleotide_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint - # Nucleotide_Phylogeny - if [ "${params.iqCustomnt}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomnt} - elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} - elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - fi - done - """ - } - } - } + label 'low_cpus' + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/ASVs/MED/", mode: "copy", overwrite: true + + input: + file(counts) from asvcount_med + file(map) from asvgroupscsv + + output: + file("${params.projtag}_ASV_Groupingcounts.csv") into asvgroupcounts - if (!params.skipAminoTyping) { + script: + """ + awk -F "," '{print \$1}' ${counts} | sed '1d' > asv.list + echo "GroupID" >> group.list + for x in \$(cat asv.list); + do group=\$(grep -w \$x ${map} | awk -F "," '{print \$2}') + echo "\$group" >> group.list + done + paste -d',' group.list ${counts} > ${params.projtag}_ASV_Groupingcounts.csv + """ + } + } else { + asvgroupscsv = Channel.empty() + asv_group_rep_tree = Channel.empty() + asvgroupcounts = Channel.empty() + } - if (params.sing) { + if (!params.skipAminoTyping) { - process Translating_For_Aminotypes { + process Translate_For_AminoTyping { label 'low_cpus' @@ -1681,2084 +2498,1862 @@ if (params.Analyze) { script: """ - conda init && source activate virtualribosome - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_all_translations.fasta --report ${params.projtag}_translation_report """ + } - } - - } else { - - process Translate_For_AminoTyping { + process Generate_AminoTypes { - label 'low_cpus' + label 'norm_cpus' - conda 'python=2.7' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{clstr,csv,gc}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/Problematic", mode: "copy", overwrite: true, pattern: '*problematic*.{fasta}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes", mode: "copy", overwrite: true, pattern: '*AminoTypes_noTaxonomy.{fasta}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/Translation", mode: "copy", overwrite: true + input: + file(prot) from amintypegen + file(asvs) from asvaminocheck - input: - file(fasta) from asvsforAminotyping + output: + tuple file("*.fasta"), file("${params.projtag}_AminoTypes.clstr"), file("${params.projtag}_clustered.gc") into ( supplementalfiles ) + file("${params.projtag}_AminoTypes_noTaxonomy.fasta") into ( aminotypesCounts, aminotypesMafft, aminotypesClustal, aminotypesBlast, aminotypesEmboss, aminos_for_med ) + file("${params.projtag}_AminoType_summary_map.csv") into aminomapmed - output: - file("${params.projtag}_all_translations.fasta") into amintypegen - file("${params.projtag}_translation_report") into proteinstage_vap_report + script: + """ + set +e + cp ${params.vampdir}/bin/rename_seq.py . + awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${prot} >${params.projtag}_filtered_translations.fasta + awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${prot} >${params.projtag}_problematic_translations.fasta + if [ `wc -l ${params.projtag}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then + grep ">" ${params.projtag}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list + seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_problematic_nucleotides.fasta + else + rm ${params.projtag}_problematic_translations.fasta + fi + cd-hit -i ${params.projtag}_filtered_translations.fasta -c 1.0 -o ${params.projtag}_unlabeled_types.fasta + sed 's/>Cluster />Cluster_/g' ${params.projtag}_unlabeled_types.fasta.clstr >${params.projtag}_AminoTypes.clstr + grep ">Cluster_" ${params.projtag}_AminoTypes.clstr >tmpclusters.list + grep -w "*" ${params.projtag}_AminoTypes.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list + grep -w "*" ${params.projtag}_AminoTypes.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list + paste -d"," tmpclusters.list tmphead.list >tmp.info.csv + grep ">" ${params.projtag}_unlabeled_types.fasta >lala.list + j=1 + for x in \$(cat lala.list);do + echo ">${params.projtag}_AminoType\${j}" >>${params.projtag}_aminoheaders.list + echo "\${x},>${params.projtag}_AminoType\${j}" >>tmpaminotype.info.csv + j=\$(( \${j}+1 )) + done + rm lala.list + awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list + for x in \$(cat tmporder.list);do + grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list + done + paste -d "," tmpclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_AminoType_summary_map.csv + rm tmp* + ./rename_seq.py ${params.projtag}_unlabeled_types.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_AminoTypes_noTaxonomy.fasta + stats.sh in=${params.projtag}_AminoTypes_noTaxonomy.fasta gc=${params.projtag}_clustered.gc gcformat=4 + """ + } - script: - """ - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_all_translations.fasta --report ${params.projtag}_translation_report - """ - } + process Generate_AminoType_Matrices { - } + label 'low_cpus' - process Generate_AminoTypes { + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Matrices", mode: "copy", overwrite: true - label 'norm_cpus' + input: + file(prot) from aminotypesClustal - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{clstr,csv,gc}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/Problematic", mode: "copy", overwrite: true, pattern: '*problematic*.{fasta}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes", mode: "copy", overwrite: true, pattern: '*AminoTypes_noTaxonomy.{fasta}' + output: + file("*.matrix") into proclustmatrices + file("*PercentID.matrix") into aminotype_heatmap - input: - file(prot) from amintypegen - file(asvs) from asvaminocheck + script: + """ + name=\$( echo ${prot} | awk -F "_noTax" '{print \$1}') + clustalo -i ${prot} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} + clustalo -i ${prot} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} + for x in *q.matrix;do + pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') + ya=\$(wc -l \$x | awk '{print \$1}') + echo "\$((\$ya-1))" + tail -"\$(( \$ya-1))" \$x > \${pre}z.matrix + rm \$x + cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix + rm \${pre}z.matrix + done + """ + } - output: - tuple file("*.fasta"), file("${params.projtag}_AminoTypes.clstr"), file("${params.projtag}_AminoType_summary_map.csv"), file("${params.projtag}_clustered.gc") into ( supplementalfiles ) - file("${params.projtag}_AminoTypes_noTaxonomy.fasta") into ( aminotypesCounts, aminotypesMafft, aminotypesClustal, aminotypesBlast, aminotypesEmboss ) + if (!params.skipEMBOSS) { - script: - """ - set +e - cp ${params.vampdir}/bin/rename_seq.py . - awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${prot} >${params.projtag}_filtered_translations.fasta - awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${prot} >${params.projtag}_problematic_translations.fasta - if [ `wc -l ${params.projtag}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then - grep ">" ${params.projtag}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list - seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_problematic_nucleotides.fasta - else - rm ${params.projtag}_problematic_translations.fasta - fi - cd-hit -i ${params.projtag}_filtered_translations.fasta -c 1.0 -o ${params.projtag}_unlabeled_types.fasta - sed 's/>Cluster />Cluster_/g' ${params.projtag}_unlabeled_types.fasta.clstr >${params.projtag}_AminoTypes.clstr - grep ">Cluster_" ${params.projtag}_AminoTypes.clstr >tmpclusters.list - grep -w "*" ${params.projtag}_AminoTypes.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list - grep -w "*" ${params.projtag}_AminoTypes.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list - paste -d"," tmpclusters.list tmphead.list >tmp.info.csv - grep ">" ${params.projtag}_unlabeled_types.fasta >lala.list - j=1 - for x in \$(cat lala.list);do - echo ">${params.projtag}_AminoType\${j}" >>${params.projtag}_aminoheaders.list - echo "\${x},>${params.projtag}_AminoType\${j}" >>tmpaminotype.info.csv - j=\$(( \${j}+1 )) - done - rm lala.list - awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list - for x in \$(cat tmporder.list);do - grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list - done - paste -d "," tmpclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_AminoType_summary_map.csv - rm tmp* - ./rename_seq.py ${params.projtag}_unlabeled_types.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_AminoTypes_noTaxonomy.fasta - stats.sh in=${params.projtag}_AminoTypes_noTaxonomy.fasta gc=${params.projtag}_clustered.gc gcformat=4 - """ - } + process AminoType_EMBOSS_Analyses { - process Generate_AminoType_Matrix { + label 'low_cpus' - label 'low_cpus' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/2dStructure", mode: "copy", overwrite: true, pattern: '*.{garnier}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/HydrophobicMoment", mode: "copy", overwrite: true, pattern: '*HydrophobicMoments.{svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/IsoelectricPoint", mode: "copy", overwrite: true, pattern: '*IsoelectricPoint.{iep,svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/ProteinProperties", mode: "copy", overwrite: true, pattern: '*.{pepstats,pepinfo}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/ProteinProperties/Plots", mode: "copy", overwrite: true, pattern: '*PropertiesPlot.{svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/2dStructure/Plots", mode: "copy", overwrite: true, pattern: '*Helical*.{svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Matrix", mode: "copy", overwrite: true + input: + file(prot) from aminotypesEmboss - input: - file(prot) from aminotypesClustal + output: + tuple file("*.garnier"), file("*HydrophobicMoments.svg"), file("*IsoelectricPoint*"), file("*.pepstats"), file("*PropertiesPlot*"), file("*Helical*") into amino_emboss - output: - file("*.matrix") into proclustmatrices - file("*PercentID.matrix") into aminotype_heatmap + script: + """ + name=\$( echo ${prot} | awk -F ".fasta" '{print \$1}') + garnier -sequence ${prot} -outfile \${name}_2dStructures.garnier + hmoment -seqall ${prot} -graph svg -plot + mv hmoment.svg ./"\${name}"_HydrophobicMoments.svg + iep -sequence ${prot} -graph svg -plot -outfile "\${name}"_IsoelectricPoint.iep + mv iep.svg ./"\${name}"_IsoelectricPoint.svg + pepstats -sequence ${prot} -outfile \${name}_ProteinProperties.pepstats + grep ">" ${prot} | awk -F ">" '{print \$2}' > tmpsequence.list + for x in \$(cat tmpsequence.list);do + echo \$x > tmp1.list + seqtk subseq ${prot} tmp1.list > tmp2.fasta + len=\$(tail -1 tmp2.fasta | awk '{print length}') + pepinfo -sequence tmp2.fasta -graph svg -outfile "\$x"_PropertiesPlot.pepinfo + mv pepinfo.svg ./"\$x"_PropertiesPlot.svg + cat "\$x"_PropertiesPlot.pepinfo >> "\${name}"_PropertiesPlot.pepinfo + rm "\$x"_PropertiesPlot.pepinfo + pepnet -sask -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len + mv pepnet.svg ./"\$x"_HelicalNet.svg + pepwheel -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len + mv pepwheel.svg ./"\$x"_HelicalWheel.svg + rm tmp1.list tmp2.fasta + done + rm tmpsequence.list + """ + } + } - script: - """ - name=\$( echo ${prot} | awk -F ".fasta" '{print \$1}') - clustalo -i ${prot} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} - clustalo -i ${prot} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$(( \$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix - done - """ - } + if (!params.skipTaxonomy) { - if (!params.skipEMBOSS) { + if (params.dbtype == "NCBI") { - process AminoType_EMBOSS_Analyses { + process AminoType_Taxonomy_Inference_NCBI { - label 'low_cpus' + label 'high_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/2dStructure", mode: "copy", overwrite: true, pattern: '*.{garnier}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/HydrophobicMoment", mode: "copy", overwrite: true, pattern: '*HydrophobicMoments.{svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/IsoelectricPoint", mode: "copy", overwrite: true, pattern: '*IsoelectricPoint.{iep,svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/ProteinProperties", mode: "copy", overwrite: true, pattern: '*.{pepstats,pepinfo}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/ProteinProperties/Plots", mode: "copy", overwrite: true, pattern: '*PropertiesPlot.{svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/EMBOSS/2dStructure/Plots", mode: "copy", overwrite: true, pattern: '*Helical*.{svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*TaxonomyLabels.fasta' - input: - file(prot) from aminotypesEmboss + input: + file(asvs) from aminotypesBlast - output: - tuple file("*.garnier"), file("*HydrophobicMoments.svg"), file("*IsoelectricPoint*"), file("*.pepstats"), file("*PropertiesPlot*"), file("*Helical*") into amino_emboss + output: + tuple file("*_phyloformat.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AA_diamond + file("*_summary_for_plot.csv") into taxplot2 + file("*TaxonomyLabels.fasta") into tax_labeled_fasta2 + file("*_quick_Taxbreakdown.csv") into tax_table_amino + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_amino - script: - """ - name=\$( echo ${prot} | awk -F ".fasta" '{print \$1}') - garnier -sequence ${prot} -outfile \${name}_2dStructures.garnier - hmoment -seqall ${prot} -graph svg -plot - mv hmoment.svg ./"\${name}"_HydrophobicMoments.svg - iep -sequence ${prot} -graph svg -plot -outfile "\${name}"_IsoelectricPoint.iep - mv iep.svg ./"\${name}"_IsoelectricPoint.svg - pepstats -sequence ${prot} -outfile \${name}_ProteinProperties.pepstats - grep ">" ${prot} | awk -F ">" '{print \$2}' > tmpsequence.list - for x in \$(cat tmpsequence.list);do - echo \$x > tmp1.list - seqtk subseq ${prot} tmp1.list > tmp2.fasta - len=\$(tail -1 tmp2.fasta | awk '{print length}') - pepinfo -sequence tmp2.fasta -graph svg -outfile "\$x"_PropertiesPlot.pepinfo - mv pepinfo.svg ./"\$x"_PropertiesPlot.svg - cat "\$x"_PropertiesPlot.pepinfo >> "\${name}"_PropertiesPlot.pepinfo - rm "\$x"_PropertiesPlot.pepinfo - pepnet -sask -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len - mv pepnet.svg ./"\$x"_HelicalNet.svg - pepwheel -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len - mv pepwheel.svg ./"\$x"_HelicalWheel.svg - rm tmp1.list tmp2.fasta - done - rm tmpsequence.list - """ + script: + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + if [[ ${params.ncbitax} == "true" ]] + then diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop staxids sskingdoms skingdoms sphylums --max-target-seqs 1 --max-hsps 1 + else diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + fi + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[AminoType#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "[NCBI Taxonomy ID],[Taxonomic classification from NCBI]" > ncbi_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Checking for \$s hit in diamond output" + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">"\${s}"_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "\$line" | awk -F "\t" '{print \$14","\$16"::"\$18"::"\$17}' >> ncbi_classification.list + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + if [[ "${params.ncbitax}" == "true" ]] + then echo "N/A" >> ncbi_classification.list + fi + echo "\$s done." + fi + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" && "${params.ncbitax}" == "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.lca}" == "T" && "${params.ncbitax}" != "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.ncbitax}" == "true" && "${params.lca}" != "T"]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + else + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + echo "skipped" >> \${name}_quick_Taxbreakdown.csv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } else if (params.dbtype== "RVDB") { + + process AminoType_Taxonomy_Inference_RVDB { + + label 'high_cpus' + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*TaxonomyLabels.fasta' + + input: + file(asvs) from aminotypesBlast + + output: + tuple file("*_phyloformat.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AA_diamond + file("*_summary_for_plot.csv") into taxplot2 + file("*TaxonomyLabels.fasta") into tax_labeled_fasta2 + file("*_quick_Taxbreakdown.csv") into tax_table_amino + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_amino + + script: + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[AminoType#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Using RVDB headers." + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + echo "\$s done." + fi + echo "Done with \$s" + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" ]] + then paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list >> \${name}_quick_Taxbreakdown.csv + else paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } } - } + } - if (!params.skipTaxonomy) { + if (!params.skipPhylogeny) { - process AminoType_Taxonomy_Inference { + process AminoType_Phylogeny { - label 'high_cpus' + label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Taxonomy", mode: "copy", overwrite: true, pattern: '*TaxonomyLabels.fasta' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/Modeltest", mode: "copy", overwrite: true, pattern: '*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' - input: - file(reads) from aminotypesBlast + input: + file(prot) from aminotypesMafft - output: - tuple file("*_phyloseqObject.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AA_diamond - file("*_summary_for_plot.csv") into taxplot2 - file("*TaxonomyLabels.fasta") into tax_labeled_fasta2 + output: + tuple file("*_aln.fasta"), file("*_aln.html"), file("*.log"), file("*iq*"), file("*mt*") into alignprot_results + file("*iq.treefile") into (amino_rax_plot, amino_repphy) - script: - """ - cp ${params.vampdir}/bin/rename_seq.py . - virdb=${params.dbdir}/${params.dbname} - grep ">" \${virdb} >> headers.list - headers="headers.list" - name=\$(ls ${reads} | awk -F "_noTaxonomy" '{print \$1}') - diamond blastp -q ${reads} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 - echo "Preparing lists to generate summary .csv's" - echo "[Best hit accession number]" >access.list - echo "[pcASV sequence length]" >length.list - echo "[e-value]" >evalue.list - echo "[Bitscore]" >bit.list - echo "[Percent ID (aa)]" >pid.list - echo "[AminoType#]" >otu.list - echo "[Virus ID]" >"\$name"_virus.list - echo "[Gene]" >"\$name"_genes.list - grep ">" ${reads} | awk -F ">" '{print \$2}' > seqids.lst - echo "extracting genes and names" - touch new_"\$name"_asvnames.txt - j=1 - for s in \$(cat seqids.lst);do - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">AminoType\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">AminoType\${j}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >>otu.list - echo "\$acc" >>access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') - echo "\$gene" | sed 's/_/ /g' >>"\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >>"\$name"_virus.list - echo ">AminoType\${j}_"\$virus"_"\$gene"" >>new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >>otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">AminoType\${j}_"\$virus"_"\$gene"" >>new_"\$name"_asvnames.txt - j=\$((\$j+1)) - echo "\$s done." - fi - fi - echo "Done with \$s" - done - echo "Now editing "\$name" fasta headers" - ###### rename_seq.py - ./rename_seq.py ${reads} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta - awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta > "\$name"_tmpssasv.fasta - echo "[Sequence header]" > newnames.list - cat new_"\$name"_asvnames.txt >> newnames.list - touch sequence.list - echo " " > sequence.list - grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list - rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloseqObject.csv - paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do - echo "\$x" - lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) - tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; - sed 's/ /_/g' tmpcol.list > tmp2col.list; - cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; - rm tmpcol.list tmp2col.list - done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst headers.list otu.list ; - rm *asvnames.txt - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: ${reads} " - """ - } - } + script: + """ + # Protein_Alignment + pre=\$(echo ${prot} | awk -F "_noTax" '{print \$1}' ) + if [[ \$(grep -c ">" ${prot}) -gt 499 ]]; then algo="super5"; else algo="mpc"; fi + ${tools}/muscle5.0.1278_linux64 -"\${algo}" ${prot} -out \${pre}_ALN.fasta -threads ${task.cpus} -quiet + trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + o-trim-uninformative-columns-from-alignment \${pre}_aln.fasta + mv \${pre}_aln.fasta-TRIMMED ./\${pre}_Aligned_informativeonly.fasta + # Protein_ModelTest + modeltest-ng -i \${pre}_Aligned_informativeonly.fasta -p ${task.cpus} -o \${pre}_mt -d aa -s 203 --disable-checkpoint - if (!params.skipPhylogeny) { + # Protein_Phylogeny + if [ "${params.iqCustomaa}" != "" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomaa} - process AminoType_Phylogeny { + elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} - label 'norm_cpus' + elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/Modeltest", mode: "copy", overwrite: true, pattern: '*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} - input: - file(prot) from aminotypesMafft + elif [ "${params.parametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + else + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } + } - output: - tuple file("*_aln.fasta"), file("*_aln.html"), file("*.log"), file("*iq*"), file("*mt*") into alignprot_results - file("*iq.treefile") into amino_rax_plot + process Generate_AminoTypes_Counts_Table { - script: - """ - # Protein_Alignment - pre=\$(echo ${prot} | awk -F ".fasta" '{print \$1}' ) - mafft --thread ${task.cpus} --maxiterate 15000 --auto ${prot} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + label 'high_cpus' - # Protein_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d aa -s 203 --disable-checkpoint + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Counts", mode: "copy", overwrite: true - # Protein_Phylogeny - if [ "${params.iqCustomaa}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomaa} + input: + file(fasta) from aminotypesCounts + file(merged) from mergeforprotcounts + file(samplist) from samplelist - elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} + output: + tuple file("*_AminoType_counts.csv"), file("*dmd.out") into counts_summary + file("*_AminoType_counts.csv") into (aminocounts_plot, aminocountmed) - elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni + script: + """ + set +e + diamond makedb --in ${fasta} --db ${fasta} + diamond blastx -q ${merged} -d ${fasta} -p ${task.cpus} --min-score ${params.ProtCountsBit} --id ${params.ProtCountID} -l ${params.ProtCountsLength} --${params.sensitivity} -o ${params.projtag}_protCounts_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 --max-hsps 1 + echo "OTU_ID" >tmp.col1.txt + echo "Generating sample id list" + grep ">" ${fasta} | awk -F ">" '{print \$2}' | sort | uniq > otuid.list + cat otuid.list >> tmp.col1.txt + echo "Beginning them counts tho my g" + for y in \$( cat ${samplist} );do + echo "Starting with \$y now ..." + grep "\$y" ${params.projtag}_protCounts_dmd.out > tmp."\$y".out + echo "Isolated hits" + echo "Created uniq subject id list" + echo "\$y" > "\$y"_col.txt + echo "Starting my counts" + for z in \$(cat otuid.list);do + echo "Counting \$z hits" + echo "grep -wc "\$z" >> "\$y"_col.txt" + grep -wc "\$z" tmp."\$y".out >> "\$y"_col.txt + echo "\$z counted" + done + done + paste -d "," tmp.col1.txt *col.txt > ${params.projtag}_AminoType_counts.csv + rm tmp* + rm *col.txt + """ + } + } - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} + if (params.aminoMED) { - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + process AminoType_Minimum_Entropy_Decomposition { - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - fi - """ - } - } + label 'low_cpus' - process Generate_AminoTypes_Counts_Table { + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/AminoTypes/MED", mode: "copy", overwrite: true - label 'high_cpus' + input: + file(aminos) from aminos_for_med - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/Counts", mode: "copy", overwrite: true + output: + file("*_AminoType_Grouping.csv") into atygroupscsv + file("${params.projtag}_AminoType_group_reps_aligned.fasta") into atygroupreps - input: - file(fasta) from aminotypesCounts - file(merged) from mergeforprotcounts - file(samplist) from samplelist - - output: - tuple file("*_protcounts.csv"), file("*dmd.out") into counts_summary - file("*_protcounts.csv") into aminocounts_plot - - script: - """ - set +e - diamond makedb --in ${fasta} --db ${fasta} - diamond blastx -q ${merged} -d ${fasta} -p ${task.cpus} --min-score ${params.ProtCountsBit} --id ${params.ProtCountID} -l ${params.ProtCountsLength} --more-sensitive -o ${params.projtag}_protCounts_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 --max-hsps 1 - echo "OTU_ID" >tmp.col1.txt - echo "Generating sample id list" - grep ">" ${fasta} | awk -F ">" '{print \$2}' | sort | uniq > otuid.list - cat otuid.list >> tmp.col1.txt - echo "Beginning them counts tho my g" - for y in \$( cat ${samplist} );do - echo "Starting with \$y now ..." - grep "\$y" ${params.projtag}_protCounts_dmd.out > tmp."\$y".out - echo "Isolated hits" - echo "Created uniq subject id list" - echo "\$y" > "\$y"_col.txt - echo "Starting my counts" - for z in \$(cat otuid.list);do - echo "Counting \$z hits" - echo "grep -wc "\$z" >> "\$y"_col.txt" - grep -wc "\$z" tmp."\$y".out >> "\$y"_col.txt - echo "\$z counted" - done - done - paste -d "," tmp.col1.txt *col.txt > ${params.projtag}_protcounts.csv - rm tmp* - rm *col.txt - """ - } - } - - if (params.pcASV) { // ASV_nucl -> ASV_aa -> clusteraa by %id with ch-hit -> extract representative nucl sequences to generate new OTU file - - if (params.sing) { - - process Translating_For_pcASV_Generation { - - label 'low_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/Translation", mode: "copy", overwrite: true, pattern: '*_ASV_translations*' - - input: - file(fasta) from nucl2aa - - output: - file("*ASV*translations.fasta") into clustering_aa - file("*_ASV_translations_report") into reportaa_VR - file("*_ASV_nucleotide.fasta") into asvfastaforaaclust - - script: - """ - conda init && source activate virtualribosome - - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASV_translations.fasta --report ${params.projtag}_ASV_translations_report - cp ${fasta} ${params.projtag}_ASV_nucleotide.fasta - """ - - } - - } else { - - process Translation_For_pcASV_Generation { - - label 'low_cpus' - - conda 'python=2.7' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/Translation", mode: "copy", overwrite: true, pattern: '*_ASV_translations*' - - input: - file(fasta) from nucl2aa - - output: - file("*ASV*translations.fasta") into clustering_aa - file("*_ASV_translations_report") into reportaa_VR - file("*_ASV_nucleotide.fasta") into asvfastaforaaclust - - script: - """ - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASV_translations.fasta --report ${params.projtag}_ASV_translations_report - cp ${fasta} ${params.projtag}_ASV_nucleotide.fasta - """ - } - } - - process Generate_pcASVs { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV", mode: "copy", overwrite: true, pattern: '*pcASV*.{fasta}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{clstr,csv,gc}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/Problematic", mode: "copy", overwrite: true, pattern: '*problem*.{fasta}' - - input: - file(fasta) from clustering_aa - file(asvs) from asvfastaforaaclust - - output: - file("${params.projtag}_nucleotide_pcASV*.fasta") into ( pcASV_ntDiamond_ch, pcASV_nt_counts_ch, pcASV_ntmatrix_ch, pcASV_ntmafft_ch ) - file("*_aminoacid_pcASV*_noTaxonomy.fasta") into ( pcASV_aaMatrix_ch, pcASV_aaDiamond_ch, pcASV_aaMafft_ch, pcASV_aaCounts_ch, pcASVEMBOSS ) - tuple file("*.fasta"), file("*.clstr"), file("*.csv"), file("*.gc") into ( pcASVsupplementalfiles ) - - script: - // add awk script to count seqs - if (params.clusterAAIDlist) { - """ - set +e - cp ${params.vampdir}/bin/rename_seq.py . - for id in `echo ${params.clusterAAIDlist} | tr "," "\\n"`;do - awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${fasta} > ${params.projtag}_filtered_proteins.fasta - cd-hit -i ${params.projtag}_filtered_proteins.fasta -c \${id} -o ${params.projtag}_pcASV\${id}.fasta - sed 's/>Cluster />Cluster_/g' ${params.projtag}_pcASV\${id}.fasta.clstr >${params.projtag}_pcASV\${id}.clstr - grep ">Cluster_" ${params.projtag}_pcASV\${id}.clstr >temporaryclusters.list - y=\$(grep -c ">Cluster_" ${params.projtag}_pcASV\${id}.clstr) - echo ">Cluster_"\${y}"" >> ${params.projtag}_pcASV\${id}.clstr - t=1 - b=1 - for x in \$(cat temporaryclusters.list);do - echo "Extracting \$x" - name="\$( echo \$x | awk -F ">" '{print \$2}')" - clust="pcASV"\${t}"" - echo "\${name}" - awk '/^>'\${name}'\$/,/^>Cluster_'\${b}'\$/' ${params.projtag}_pcASV\${id}.clstr > "\${name}"_"\${clust}"_tmp.list - t=\$(( \${t}+1 )) - b=\$(( \${b}+1 )) - done - ls *_tmp.list - u=1 - for x in *_tmp.list;do - name="\$(echo \$x | awk -F "_p" '{print \$1}')" - echo "\${name}" - cluster="\$(echo \$x | awk -F "_" '{print \$3}')" - echo "\${cluster}" - grep "ASV" \$x | awk -F ", " '{print \$2}' | awk -F "_" '{print \$1}' | awk -F ">" '{print \$2}' > \${name}_\${cluster}_seqs_tmps.list - seqtk subseq ${asvs} \${name}_\${cluster}_seqs_tmps.list > \${name}_\${cluster}_nucleotide_sequences.fasta - vsearch --cluster_fast \${name}_\${cluster}_nucleotide_sequences.fasta --id 0.2 --centroids \${name}_\${cluster}_centroids.fasta - grep ">" \${name}_\${cluster}_centroids.fasta >> \${name}_\${cluster}_tmp_centroids.list - for y in \$( cat \${name}_\${cluster}_tmp_centroids.list );do - echo ">\${cluster}_type"\$u"" >> \${name}_\${cluster}_tmp_centroid.newheaders - u=\$(( \${u}+1 )) - done - u=1 - ./rename_seq.py \${name}_\${cluster}_centroids.fasta \${name}_\${cluster}_tmp_centroid.newheaders \${cluster}_types_labeled.fasta - done - cat *_types_labeled.fasta >> ${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list - paste -d"," temporaryclusters.list tmphead.list >tmp.info.csv - grep ">" ${params.projtag}_pcASV\${id}.fasta >lala.list - j=1 - for x in \$(cat lala.list);do - echo ">${params.projtag}_pcASV\${j}" >>${params.projtag}_aminoheaders.list - echo "\${x},>${params.projtag}_pcASV\${j}" >>tmpaminotype.info.csv - j=\$(( \${j}+1 )) - done - rm lala.list - awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list - for x in \$(cat tmporder.list);do - grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list - done - paste -d "," temporaryclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_pcASVCluster\${id}_summary.csv - ./rename_seq.py ${params.projtag}_pcASV\${id}.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta - stats.sh in=${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_aminoacid_clustered.gc gcformat=4 overwrite=true - stats.sh in=${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_nucleotide_clustered.gc gcformat=4 overwrite=true - awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${fasta} >${params.projtag}_pcASV\${id}_problematic_translations.fasta - if [ `wc -l ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then - grep ">" ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list - seqtk subseq ${asvs} > ${params.projtag}_pcASV\${id}_problematic_nucleotides.fasta - else - rm ${params.projtag}_pcASV\${id}_problematic_translations.fasta - fi - rm *.list - rm Cluster* - rm *types* - rm *tmp* - rm ${params.projtag}_pcASV\${id}.fast* - done - """ - } else if (params.clusterAAID) { - """ - set +e - cp ${params.vampdir}/bin/rename_seq.py . - id=${params.clusterAAID} - awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${fasta} > ${params.projtag}_filtered_proteins.fasta - cd-hit -i ${params.projtag}_filtered_proteins.fasta -c ${params.clusterAAID} -o ${params.projtag}_pcASV\${id}.fasta - sed 's/>Cluster />Cluster_/g' ${params.projtag}_pcASV\${id}.fasta.clstr >${params.projtag}_pcASV\${id}.clstr - grep ">Cluster_" ${params.projtag}_pcASV\${id}.clstr >temporaryclusters.list - y=\$(grep -c ">Cluster_" ${params.projtag}_pcASV\${id}.clstr) - echo ">Cluster_"\${y}"" >> ${params.projtag}_pcASV\${id}.clstr - t=1 - b=1 - for x in \$(cat temporaryclusters.list);do - echo "Extracting \$x" - name="\$( echo \$x | awk -F ">" '{print \$2}')" - clust="pcASV"\${t}"" - echo "\${name}" - awk '/^>'\${name}'\$/,/^>Cluster_'\${b}'\$/' ${params.projtag}_pcASV\${id}.clstr > "\${name}"_"\${clust}"_tmp.list - t=\$(( \${t}+1 )) - b=\$(( \${b}+1 )) - done - - ls *_tmp.list - u=1 - for x in *_tmp.list;do - name="\$(echo \$x | awk -F "_p" '{print \$1}')" - echo "\${name}" - cluster="\$(echo \$x | awk -F "_" '{print \$3}')" - echo "\${cluster}" - grep "ASV" \$x | awk -F ", " '{print \$2}' | awk -F "_" '{print \$1}' | awk -F ">" '{print \$2}' > \${name}_\${cluster}_seqs_tmps.list - seqtk subseq ${asvs} \${name}_\${cluster}_seqs_tmps.list > \${name}_\${cluster}_nucleotide_sequences.fasta - vsearch --cluster_fast \${name}_\${cluster}_nucleotide_sequences.fasta --id 0.2 --centroids \${name}_\${cluster}_centroids.fasta - grep ">" \${name}_\${cluster}_centroids.fasta >> \${name}_\${cluster}_tmp_centroids.list - for y in \$( cat \${name}_\${cluster}_tmp_centroids.list );do - echo ">\${cluster}_type"\$u"" >> \${name}_\${cluster}_tmp_centroid.newheaders - u=\$(( \${u}+1 )) - done - u=1 - ./rename_seq.py \${name}_\${cluster}_centroids.fasta \${name}_\${cluster}_tmp_centroid.newheaders \${cluster}_types_labeled.fasta - done - cat *_types_labeled.fasta >> ${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list - paste -d"," temporaryclusters.list tmphead.list >tmp.info.csv - grep ">" ${params.projtag}_pcASV\${id}.fasta >lala.list - j=1 - for x in \$(cat lala.list);do - echo ">${params.projtag}_pcASV\${j}" >>${params.projtag}_aminoheaders.list - echo "\${x},>${params.projtag}_pcASV\${j}" >>tmpaminotype.info.csv - j=\$(( \${j}+1 )) - done - rm lala.list - awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list - for x in \$(cat tmporder.list);do - grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list - done - paste -d "," temporaryclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_pcASVCluster\${id}_summary.csv - ./rename_seq.py ${params.projtag}_pcASV\${id}.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta - stats.sh in=${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_aminoacid_clustered.gc gcformat=4 - stats.sh in=${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_nucleotide_clustered.gc gcformat=4 - awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${fasta} >${params.projtag}_pcASV\${id}_problematic_translations.fasta - if [ `wc -l ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then - grep ">" ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list - seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_pcASV\${id}_problematic_nucleotides.fasta - else - rm ${params.projtag}_pcASV\${id}_problematic_translations.fasta - fi - rm *.list - rm Cluster* - rm *types* - rm *tmp* - rm ${params.projtag}_pcASV\${id}.fast* - """ - } - } - - if (!params.skipTaxonomy) { - - process pcASV_Nucleotide_Taxonomy_Inference { - - label 'high_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' - - input: - file(reads) from pcASV_ntDiamond_ch - - output: - file("*.fasta") into ( pcASV_labeled ) - tuple file("*_phyloseqObject.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AAdiamond - file("*_summary_for_plot.csv") into taxplot3 - - script: - """ - set +e - cp ${params.vampdir}/bin/rename_seq.py . - virdb=${params.dbdir}/${params.dbname} - grep ">" \${virdb} >> headers.list - headers="headers.list" - for filename in ${reads};do - name=\$(ls \${filename} | awk -F "_noTax" '{print \$1}') - diamond blastx -q \${filename} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 - echo "Preparing lists to generate summary .csv's" - echo "[Best hit accession number]" >access.list - echo "[pcASV sequence length]" >length.list - echo "[e-value]" >evalue.list - echo "[Bitscore]" >bit.list - echo "[Percent ID (aa)]" >pid.list - echo "[pcASV#]" >otu.list - echo "[Virus ID]" >"\$name"_virus.list - echo "[Gene]" >"\$name"_genes.list - grep ">" \${filename} | awk -F ">" '{print \$2}' > seqids.lst - echo "extracting genes and names" - touch new_"\$name"_headers.txt - j=1 - for s in \$(cat seqids.lst);do - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">pcASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">pcASV\${j}_"\$virus"_"\$gene"" >> new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >>otu.list - echo "\$acc" >>access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >>"\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && - echo "\$virus" | sed 's/_/ /g' >>"\$name"_virus.list - echo ">pcASV\${j}_"\$virus"_"\$gene"" >>new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >>otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">pcASV\${j}_"\$virus"_"\$gene"" >>new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - fi - fi - echo "Done with \$s" - done - echo "Now editing "\$name" fasta headers" - ###### rename_seq.py - ./rename_seq.py \${filename} new_"\$name"_headers.txt "\$name"_TaxonomyLabels.fasta - awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta > "\$name"_tmpssasv.fasta - echo "[Sequence header]" > newnames.list - cat new_"\$name"_headers.txt >> newnames.list - touch sequence.list - echo " " > sequence.list - grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list - rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloseqObject.csv - paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do - echo "\$x" - lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) - tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; - sed 's/ /_/g' tmpcol.list > tmp2col.list; - cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; - rm tmpcol.list tmp2col.list - done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst otu.list ; - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: \${filename} " - done - rm *headers.list - """ - } - } - - process Generate_Nucleotide_pcASV_Counts { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Counts", mode: "copy", overwrite: true, pattern: '*.{biome,csv,txt}' - - input: - file(potus) from pcASV_nt_counts_ch - file(merged) from pcASV_mergedreads_ch - - output: - tuple file("*_counts.txt"), file("*_counts.biome") into pcASVcounts_vsearch - file("*.csv") into potu_Ncounts_for_report - - script: - """ - for filename in ${potus};do - ident=\$( echo \${filename} | awk -F "pcASV" '{print \$2}' | awk -F "_noTaxonomy.fasta" '{print \$1}') - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - vsearch --usearch_global ${merged} --db \${filename} --id \${ident} --threads ${task.cpus} --otutabout \${name}_counts.txt --biomout \${name}_counts.biome - cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv - sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv - rm \${name}_count.csv - done - """ - } - - process Generate_pcASV_Nucleotide_Matrix { - - label 'low_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Matrix", mode: "copy", overwrite: true - - input: - file(potus) from pcASV_ntmatrix_ch - - output: - file("*.matrix") into pcASVclustmatrices - file("*PercentID.matrix") into potu_nucl_heatmap - - script: - """ - for filename in ${potus};do - ident=\$( echo \${filename} | awk -F "pcASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - clustalo -i \${filename} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} - clustalo -i \${filename} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$((\$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix - done - done - """ - } - - if (!params.skipPhylogeny) { - - process pcASV_Nucleotide_Phylogeny { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' - - input: - file(reads) from pcASV_ntmafft_ch - - output: - tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into pcASV_nucleotide_phylogeny_results - file("*iq.treefile") into potu_Ntree_plot - - script: - """ - for filename in ${reads};do - pre=\$( echo \${filename} | awk -F "_noTax" '{print \$1}' ) - mafft --maxiterate 5000 --auto \${filename} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html - - # pcASV_Nucleotide_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d nt -s 203 --disable-checkpoint - - # pcASV_Nucleotide_Phylogeny - if [ "${params.iqCustomnt}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -t \${pre}_mt.tree -T auto ${params.iqCustomnt} - - elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} - - elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -b ${params.boots} - - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -t \${pre}_mt.tree -nt auto -bb ${params.boots} -bnni - fi - done - """ - } - } - - process pcASV_AminoAcid_Matrix { - - label 'low_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Matrix", mode: "copy", overwrite: true - - input: - file(prot) from pcASV_aaMatrix_ch - - output: - file("*.matrix") into pcASVaaMatrix - file("*PercentID.matrix") into potu_aa_heatmap - - script: - """ - for filename in ${prot};do - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - clustalo -i \${filename} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} - clustalo -i \${filename} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} - for x in *q.matrix;do - pre=\$(echo "\$x" | awk -F "q.matrix" '{print \$1}') - ya=\$(wc -l \$x | awk '{print \$1}') - echo "\$((\$ya-1))" - tail -"\$((\$ya-1))" \$x > \${pre}z.matrix - rm \$x - cat \${pre}z.matrix | sed 's/ /,/g' | sed -E 's/(,*),/,/g' >\${pre}.matrix - rm \${pre}z.matrix - done - done - """ - } - - if (!params.skipEMBOSS) { - - process pcASV_EMBOSS_Analyses { - - label 'low_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/2dStructure", mode: "copy", overwrite: true, pattern: '*.{garnier}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/HydrophobicMoment", mode: "copy", overwrite: true, pattern: '*HydrophobicMoments.{svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/IsoelectricPoint", mode: "copy", overwrite: true, pattern: '*IsoelectricPoint.{iep,svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/ProteinProperties", mode: "copy", overwrite: true, pattern: '*.{pepstats,pepinfo}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/ProteinProperties/Plots", mode: "copy", overwrite: true, pattern: '*PropertiesPlot.{svg}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/2dStructure/Plots", mode: "copy", overwrite: true, pattern: '*Helical*.{svg}' - - input: - file(prot) from pcASVEMBOSS - - output: - tuple file("*.garnier"), file("*HydrophobicMoments.svg"), file("*IsoelectricPoint*"), file("*.pepstats"), file("*PropertiesPlot*"), file("*Helical*") into pcASV_emboss - - script: - """ - for filename in ${prot};do - name=\$( echo \${filename} | awk -F ".fasta" '{print \$1}') - garnier -sequence \${filename} -outfile \${name}_2dStructures.garnier - hmoment -seqall \${filename} -graph svg -plot - mv hmoment.svg ./"\${name}"_HydrophobicMoments.svg - iep -sequence \${filename} -graph svg -plot -outfile "\${name}"_IsoelectricPoint.iep - mv iep.svg ./"\${name}"_IsoelectricPoint.svg - pepstats -sequence \${filename} -outfile \${name}_ProteinProperties.pepstats - grep ">" \${filename} | awk -F ">" '{print \$2}' > tmpsequence.list - for x in \$(cat tmpsequence.list);do - echo \$x > tmp1.list - seqtk subseq \${filename} tmp1.list > tmp2.fasta - len=\$(tail -1 tmp2.fasta | awk '{print length}') - pepinfo -sequence tmp2.fasta -graph svg -outfile "\$x"_PropertiesPlot.pepinfo - mv pepinfo.svg ./"\$x"_PropertiesPlot.svg - cat "\$x"_PropertiesPlot.pepinfo >> "\${name}"_PropertiesPlot.pepinfo - rm "\$x"_PropertiesPlot.pepinfo - pepnet -sask -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len - mv pepnet.svg ./"\$x"_HelicalNet.svg - pepwheel -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len - mv pepwheel.svg ./"\$x"_HelicalWheel.svg - rm tmp1.list tmp2.fasta - done - rm tmpsequence.list - done - """ - } - } - - if (!params.skipTaxonomy) { - - process pcASV_AminoAcid_Taxonomy_Inference { - - label 'high_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' - - input: - file(reads) from pcASV_aaDiamond_ch - - output: - file("*.fasta") into ( pcASV_labeledAA ) - tuple file("*phyloseqObject.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_potuaadiamond - file("*_summary_for_plot.csv") into taxplot4 - - script: - """ - cp ${params.vampdir}/bin/rename_seq.py . - virdb=${params.dbdir}/${params.dbname} - grep ">" \${virdb} >> headers.list - headers="headers.list" - for filename in ${reads};do - name=\$(ls \${filename} | awk -F ".fasta" '{print \$1}') - diamond blastp -q \${filename} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} --min-score ${params.bitscore} --more-sensitive -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 - echo "Preparing lists to generate summary .csv's" - echo "[Best hit accession number]" >access.list - echo "[pcASV sequence length]" >length.list - echo "[e-value]" >evalue.list - echo "[Bitscore]" >bit.list - echo "[Percent ID (aa)]" >pid.list - echo "[pcASVaa#]" >otu.list - echo "[Virus ID]" >"\$name"_virus.list - echo "[Gene]" >"\$name"_genes.list - grep ">" \${filename} | awk -F ">" '{print \$2}' > seqids.lst - echo "extracting genes and names" - touch new_"\$name"_headers.txt - j=1 - for s in \$(cat seqids.lst);do - echo "Checking for \$s hit in diamond output" - if [[ ${params.refseq} == "T" ]];then - echo "RefSeq headers specified" - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') - echo "\$s" >> otu.list - echo "\$acc" >> access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') - echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list - echo ">pcASVaa\${j}_"\$virus"_"\$gene"" >> new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >> otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">pcASVaa\${j}_"\$virus"_"\$gene"" >> new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - fi - else - echo "Using RVDB headers." - if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then - echo "Yep, there was a hit for \$s" - echo "Extracting the information now:" - acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') - echo "\$s" >>otu.list - echo "\$acc" >>access.list - line="\$(grep -w "\$s" "\$name"_dmd.out)" - echo "\$line" | awk '{print \$10}' >>evalue.list - echo "\$line" | awk '{print \$11}' >>bit.list - echo "\$line" | awk '{print \$12}' >>pid.list - echo "\$line" | awk '{print \$2}' >>length.list - echo "Extracting virus and gene ID for \$s now" - gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && - echo "\$gene" | sed 's/_/ /g' >>"\$name"_genes.list - virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && - echo "\$virus" | sed 's/_/ /g' >>"\$name"_virus.list - echo ">pcASVaa\${j}_"\$virus"_"\$gene"" >>new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - else - echo "Ugh, there was no hit for \$s .." - echo "We still love \$s though and we will add it to the final fasta file" - echo "\$s" >>otu.list - echo "NO_HIT" >>access.list - echo "NO_HIT" >>"\$name"_genes.list - echo "NO_HIT" >>"\$name"_virus.list - echo "NO_HIT" >>evalue.list - echo "NO_HIT" >>bit.list - echo "NO_HIT" >>pid.list - echo "NO_HIT" >>length.list - virus="NO" - gene="HIT" - echo ">pcASVaa\${j}_\${virus}_\${gene}" >>new_"\$name"_headers.txt - j=\$((\$j+1)) - echo "\$s done." - fi - fi - echo "Done with \$s" - done - echo "Now editing "\$name" fasta headers" - ###### rename_seq.py - ./rename_seq.py \${filename} new_"\$name"_headers.txt "\$name"_wTax.fasta - awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_wTax.fasta > "\$name"_tmpssasv.fasta - echo "[Sequence header]" > newnames.list - cat new_"\$name"_headers.txt >> newnames.list - touch sequence.list - awk 'BEGIN{RS=">";ORS=""}{print \$2"\\n"}' \${name}_tmpssasv.fasta >>sequence.list - rm "\$name"_tmpssasv.fasta - paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_summary_phyloseqObject.csv - paste -d"\\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv - for x in *phyloseqObject.csv;do - echo "\$x" - lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) - tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; - sed 's/ /_/g' tmpcol.list > tmp2col.list; - cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; - rm tmpcol.list tmp2col.list - done - rm evalue.list ; rm sequence.list ; rm bit.list ; rm pid.list ; rm length.list seqids.lst otu.list ; - rm "\$name"_virus.list - rm "\$name"_genes.list - rm newnames.list - rm access.list - echo "Taxonomy inferred for: \${filename} " - done - rm *headers.list - """ - } - } - - if (!params.skipPhylogeny) { - - process pcASV_Protein_Phylogeny { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/Modeltest", mode: "copy", overwrite: true, pattern: '*mt*' - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' - - input: - file(prot) from pcASV_aaMafft_ch - - output: - tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into pcASV_protein_phylogeny_results - file("*iq.treefile") into potu_Atree_plot - - script: + script: """ - for filename in ${prot};do - pre=\$( echo \${filename} | awk -F ".fasta" '{print \$1}' ) - mafft --maxiterate 5000 --auto \${filename} >\${pre}_ALN.fasta - trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html - - # pcASV_Protein_ModelTest - modeltest-ng -i \${pre}_aln.fasta -p ${task.cpus} -o \${pre}_mt -d aa -s 203 --disable-checkpoint - - # pcASV_Protein_Phylogeny - if [ "${params.iqCustomaa}" != "" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomaa} - - elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} - - elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then - mod=\$(tail -12 \${pre}_aln.fasta.log | head -1 | awk '{print \$6}') - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - - elif [ "${params.nonparametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} - - elif [ "${params.parametric}" != "false" ];then - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + #alignment + if [[ \$(grep -c ">" ${aminos}) -gt 499 ]]; then algo="super5"; else algo="mpc"; fi + ${tools}/muscle5.0.1278_linux64 -"\${algo}" ${aminos} -out ${params.projtag}_AminoTypes_muscleAlign.fasta -threads ${task.cpus} -quiet + #trimming + trimal -in ${params.projtag}_AminoTypes_muscleAlign.fasta -out ${params.projtag}_AminoTypes_muscleAligned.fasta -keepheader -fasta -automated1 + rm ${params.projtag}_AminoTypes_muscleAlign.fasta + o-trim-uninformative-columns-from-alignment ${params.projtag}_AminoTypes_muscleAligned.fasta + mv ${params.projtag}_AminoTypes_muscleAligned.fasta-TRIMMED ./${params.projtag}_AminoTypes_Aligned_informativeonly.fasta + #entopy analysis + entropy-analysis ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta + #Decomposition + if [[ \$(echo ${params.aminoC} | grep -c ",") -gt 0 ]] + then + tag=\$(echo ${params.aminoC} | sed 's/,/_/g') + oligotype ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_AminoTypeMED_"\$tag" -M 1 -C ${params.aminoC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + elif [[ "${params.aminoSingle}" == "true" ]] + then + tag="${params.aminoC}" + oligotype ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_AminoTypeMED_"\$tag" -M 1 -C ${params.aminoC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + else + oligotype ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta ${params.projtag}_AminoTypes_Aligned_informativeonly.fasta-ENTROPY -o ${params.projtag}_AminoTypeMED_${params.aminoC} -M 1 -c ${params.aminoC} -N ${task.cpus} --skip-check-input --no-figures --skip-gen-html + fi + #generatemaps + cd ./${params.projtag}_AminoTypeMED_${params.aminoC}/OLIGO-REPRESENTATIVES/ + echo "AminoType,Group,IDPattern" + j=1 + for x in *_unique; + do gid=\$(echo \$x | awk -F "_" '{print \$1}') + uni=\$(echo \$x | awk -F ""\${gid}"_" '{print \$2}' | awk -F "_uni" '{print \$1}') + grep ">" "\$gid"_"\$uni" | awk -F ">" '{print \$2}' > asv.list + seqtk subseq ../../${aminos} asv.list > Group"\${j}"_sequences.fasta + for z in \$( cat asv.list) + do echo ""\$z",Group"\$j","\$uni"" >> ${params.projtag}_AminoType_Grouping.csv - else - iqtree -s \${pre}_aln.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - fi + done + rm asv.list + echo ">Group\${j}" >> ${params.projtag}_AminoType_group_reps_aligned.fasta + echo "\$uni" > group.list + seqtk subseq ../OLIGO-REPRESENTATIVES.fasta group.list > group.fasta + tail -1 group.fasta >> ${params.projtag}_AminoType_group_reps_aligned.fasta + mv "\$gid"_"\$uni" ./Group"\$j"_"\$uni"_aligned.fasta + mv "\$gid"_"\$uni"_unique ./Group"\$j"_"\$uni"_unqiues_aligned.fasta + rm "\$gid"*.cPickle + j=\$((\$j+1)) done - """ - } - } - - process Generate_pcASV_Protein_Counts { - - label 'high_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Counts", mode: "copy", overwrite: true - - input: - file(fasta) from pcASV_aaCounts_ch - file(merged) from mergeforpcASVaacounts - file(samplist) from samplistpotu - - output: - tuple file("*_counts.csv"), file("*dmd.out") into potuaacounts_summary - file("*counts.csv") into potu_Acounts - - script: - """ - set +e - for filename in ${fasta};do - potu="\$( echo \${filename} | awk -F "_" '{print \$3}')" - diamond makedb --in \${filename} --db \${filename} - diamond blastx -q ${merged} -d \${filename} -p ${task.cpus} --min-score ${params.ProtCountsBit} --id ${params.ProtCountID} -l ${params.ProtCountsLength} --more-sensitive -o ${params.projtag}_\${potu}_Counts_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 --max-hsps 1 - echo "OTU_ID" >tmp.col1.txt - echo "Generating sample id list" - grep ">" \${filename} | awk -F ">" '{print \$2}' | sort | uniq > otuid.list - cat otuid.list >> tmp.col1.txt - echo "Beginning them counts tho my g" - for y in \$( cat ${samplist} );do - echo "Starting with \$y now ..." - grep "\$y" ${params.projtag}_\${potu}_Counts_dmd.out > tmp."\$y".out - echo "Isolated hits" - echo "Created uniq subject id list" - echo "\$y" > "\$y"_col.txt - echo "Starting my counts" - for z in \$(cat otuid.list);do - echo "Counting \$z hits" - echo "grep -wc "\$z" >> "\$y"_col.txt" - grep -wc "\$z" tmp."\$y".out >> "\$y"_col.txt - echo "\$z counted" - done - done - paste -d "," tmp.col1.txt *col.txt > ${params.projtag}_\${potu}_counts.csv - rm tmp* - rm *col.txt - done - """ - } - } + mv ${params.projtag}_AminoType_Grouping.csv ../../ + mv ${params.projtag}_AminoType_group_reps_aligned.fasta ../../ + cd .. - if (!params.skipReport) { - - if (!params.skipAdapterRemoval) { - - process combine_csv { - - input: - file(csv) from fastp_csv - .collect() - - output: - file("final_reads_stats.csv") into ( fastp_csv1, fastp_csv2, fastp_csv3, fastp_csv4, fastp_csv5 ) - - script: """ - cat ${csv} >all_reads_stats.csv - head -n1 all_reads_stats.csv >tmp.names.csv - cat all_reads_stats.csv | grep -v ""Sample,Total_"" >tmp.reads.stats.csv - cat tmp.names.csv tmp.reads.stats.csv >final_reads_stats.csv - rm tmp.names.csv tmp.reads.stats.csv - """ - - } - } - - if (params.ncASV) { - - - process Report_ASV { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport", mode: "copy", overwrite: true - - input: - file(counts) from asv_counts_plots - file(taxonomy) from taxplot1 - file(matrix) from asv_heatmap - file(readsstats) from fastp_csv1 - - output: - file("*.html") into report_summaryA - - script: - """ - name=\$( echo ${taxonomy} | awk -F "_summary_for_plot.csv" '{print \$1}') - cp ${params.vampdir}/bin/vAMPirus_ReportA.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - Rscript -e "rmarkdown::render('vAMPirus_ReportA.Rmd',output_file='\${name}_ASV_Report.html')" \${name} \ - ${readsstats} \ - ${counts} \ - ${params.metadata} \ - ${params.minimumCounts} \ - ${matrix} \ - ${taxonomy} \ - ${params.trymax} \ - ${params.stats} - """ - } - - process Report_ncASV { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport/ncASV", mode: "copy", overwrite: true - - input: - file(counts) from notu_counts_plots - file(taxonomy) from taxplot1a - file(matrix) from notu_heatmap - file(phylogeny) from nucl_phyl_plot - file(readsstats) from fastp_csv2 - - output: - file("*.html") into report_summaryB - - script: - """ - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - for x in *_summary_for_plot.csv;do - name=\$( echo \${x} | awk -F "_summary_for_plot.csv" '{print \$1}') - id=\$( echo \${x} | awk -F "_summary_for_plot.csv" '{print \$1}' | cut -f 2 -d "." ) - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_ncASV\${id}_Report.html')" \${name} \ - ${readsstats} \ - \$( echo ${counts} | tr " " "\\n" | grep "\${id}" ) \ - ${params.metadata} \ - ${params.minimumCounts} \ - \$( echo ${matrix} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${taxonomy} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${phylogeny} | tr " " "\\n" | grep "\${id}" ) \ - ${params.trymax} \ - ${params.stats} - done - """ - } - - if (!params.skipAminoTyping) { - - process Report_AminoTypes { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport", mode: "copy", overwrite: true - - input: - file(counts) from aminocounts_plot - file(taxonomy) from taxplot2 - file(matrix) from aminotype_heatmap - file(phylogeny) from amino_rax_plot - file(readsstats) from fastp_csv5 - - output: - file("*.html") into report_summaryE - - script: - """ - name=\$( echo ${taxonomy} | awk -F "_summary_for_plot.csv" '{print \$1}') - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_AminoType_Report.html')" \${name} \ - ${readsstats} \ - ${counts} \ - ${params.metadata} \ - ${params.minimumCounts} ${matrix} \ - ${taxonomy} \ - ${phylogeny} \ - ${params.trymax} \ - ${params.stats} - """ } - } - } else { - process Report_ASVs { + if (!params.skipPhylogeny) { - label 'norm_cpus' + process AminoType_MED_Reps_phylogeny { + + label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/MED/Phylogeny/Modeltest", mode: "copy", overwrite: true, pattern: '*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/MED/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' input: - file(counts) from asv_counts_plots - file(taxonomy) from taxplot1 - file(matrix) from asv_heatmap - file(phylogeny) from nucl_phyl_plot - file(readsstats) from fastp_csv1 + file(reps) from atygroupreps output: - file("*.html") into report_summaryA + file("*_AminoType_Group_Reps*") into align_results_aminmed + file("*iq.treefile") into amino_group_rep_tree script: """ - name=\$( echo ${taxonomy} | awk -F "_summary_for_plot.csv" '{print \$1}') - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_ASV_Report.html')" \${name} \ - ${readsstats} \ - ${counts} \ - ${params.metadata} \ - ${params.minimumCounts} \ - ${matrix} \ - ${taxonomy} \ - ${phylogeny} \ - ${params.trymax} \ - ${params.stats} - """ - } - - if (!params.skipAminoTyping) { - - process Report_AminoType { - - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport", mode: "copy", overwrite: true - - input: - file(counts) from aminocounts_plot - file(taxonomy) from taxplot2 - file(matrix) from aminotype_heatmap - file(phylogeny) from amino_rax_plot - file(readsstats) from fastp_csv5 - - output: - file("*.html") into report_summaryE - - script: - """ - name=\$( echo ${taxonomy} | awk -F "_summary_for_plot.csv" '{print \$1}') - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_AminoType_Report.html')" \${name} \ - ${readsstats} \ - ${counts} \ - ${params.metadata} \ - ${params.minimumCounts} ${matrix} \ - ${taxonomy} \ - ${phylogeny} \ - ${params.trymax} \ - ${params.stats} - """ - } - } - } + # Protein_ModelTest + modeltest-ng -i ${reps} -p ${task.cpus} -o ${params.projtag}_AminoType_Group_Reps_mt -d aa -s 203 --disable-checkpoint - if (params.pcASV) { + # Protein_Phylogeny + if [ "${params.iqCustomaa}" != "" ];then + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq --redo -T auto ${params.iqCustomaa} - process Report_pcASV_AminoAcid { + elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 ${reps}.log | head -1 | awk '{print \$6}') + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq -m \${mod} --redo -nt auto -b ${params.boots} - label 'norm_cpus' - - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport/pcASV/Aminoacid", mode: "copy", overwrite: true + elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 ${reps}.log | head -1 | awk '{print \$6}') + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - input: - file(counts) from potu_Acounts - file(taxonomy) from taxplot4 - file(matrix) from potu_aa_heatmap - file(phylogeny) from potu_Atree_plot - file(readsstats) from fastp_csv3 + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq -m MFP --redo -nt auto -b ${params.boots} - output: - file("*.html") into report_summaryC + elif [ "${params.parametric}" != "false" ];then + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - script: - """ - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - for x in *_summary_for_plot.csv;do - name=\$( echo \${x} | awk -F "_noTaxonomy_summary_for_plot.csv" '{print \$1}') - id=\$( echo \${x} | awk -F "_noTaxonomy_summary_for_plot.csv" '{print \$1}' | cut -f 2 -d "." ) - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_pcASVaa\${id}_Report.html')" \${name} \ - ${readsstats} \ - \$( echo ${counts} | tr " " "\\n" | grep "\${id}" ) \ - ${params.metadata} \ - ${params.minimumCounts} \ - \$( echo ${matrix} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${taxonomy} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${phylogeny} | tr " " "\\n" | grep "\${id}" ) \ - ${params.trymax} \ - ${params.stats} - done - """ - } + else + iqtree -s ${reps} --prefix ${params.projtag}_AminoType_Group_Reps_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } - process Report_pcASV_Nucleotide { + process Adding_AminoType_MED_Info { - label 'norm_cpus' + label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReport/pcASV/Nucleotide", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/AminoTypes/MED/", mode: "copy", overwrite: true - input: - file(counts) from potu_Ncounts_for_report - file(taxonomy) from taxplot3 - file(matrix) from potu_nucl_heatmap - file(phylogeny) from potu_Ntree_plot - file(readsstats) from fastp_csv4 + input: + file(counts) from aminocountmed + file(tree) from amino_repphy + file(map) from atygroupscsv - output: - file("*.html") into report_summaryD + output: + file("${params.projtag}_AminoType_Groupingcounts.csv") into amino_groupcounts - script: - """ - cp ${params.vampdir}/bin/vAMPirus_ReportB.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - for x in *_summary_for_plot.csv;do - name=\$( echo \${x} | awk -F "_summary_for_plot.csv" '{print \$1}') - id=\$( echo \${x} | awk -F "_summary_for_plot.csv" '{print \$1}' | cut -f 2 -d "." ) - Rscript -e "rmarkdown::render('vAMPirus_ReportB.Rmd',output_file='\${name}_pcASVnt\${id}_Report.html')" \${name} \ - ${readsstats} \ - \$( echo ${counts} | tr " " "\\n" | grep "\${id}" ) \ - ${params.metadata} \ - ${params.minimumCounts} \ - \$( echo ${matrix} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${taxonomy} | tr " " "\\n" | grep "\${id}" ) \ - \$( echo ${phylogeny} | tr " " "\\n" | grep "\${id}" ) \ - ${params.trymax} \ - ${params.stats} - done - """ + script: + """ + awk -F "," '{print \$1}' ${counts} | sed '1d' > amino.list + echo "GroupID" >> group.list + for x in \$(cat amino.list); + do group=\$(grep -w \$x ${map} | awk -F "," '{print \$2}') + echo "\$group" >> group.list + done + paste -d',' group.list ${counts} > ${params.projtag}_AminoType_Groupingcounts.csv + """ } } - } + } else { + atygroupscsv = Channel.empty() + amino_group_rep_tree = Channel.empty() + amino_groupcounts = Channel.empty() + } -} else if (params.DataCheck) { + if (params.pcASV) { // ASV_nucl -> ASV_aa -> clusteraa by %id with ch-hit -> extract representative nucl sequences to generate new OTU file - println("\n\tRunning vAMPirus DataCheck \n") + process Translation_For_pcASV_Generation { - process QualityCheck_1DC { + label 'low_cpus' - label 'low_cpus' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/Translation", mode: "copy", overwrite: true, pattern: '*_ASV_translations*' - tag "${sample_id}" + input: + file(fasta) from nucl2aa - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/FastQC/PreClean", mode: "copy", overwrite: true + output: + file("*ASV*translations.fasta") into clustering_aa + file("*_ASV_translations_report") into reportaa_VR + file("*_ASV_nucleotide.fasta") into asvfastaforaaclust - input: - tuple sample_id, file(reads) from reads_qc_ch + script: + """ + ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASV_translations.fasta --report ${params.projtag}_ASV_translations_report + cp ${fasta} ${params.projtag}_ASV_nucleotide.fasta + """ + } - output: - tuple sample_id, file("*_fastqc.{zip,html}") into fastqc_results_OAS + process Generate_pcASVs { - script: - """ - fastqc --quiet --threads ${task.cpus} ${reads} - """ - } + label 'norm_cpus' - process Adapter_Removal_DC { + tag "${mtag}" - label 'norm_cpus' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV", mode: "copy", overwrite: true, pattern: '*pcASV*.{fasta}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{clstr,csv,gc}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Clustering/pcASV/Problematic", mode: "copy", overwrite: true, pattern: '*problem*.{fasta}' - tag "${sample_id}" + input: + each x from 1..naa + file(fasta) from clustering_aa + file(asvs) from asvfastaforaaclust - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/AdapterRemoval", mode: "copy", overwrite: true, pattern: "*.filter.fq" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/AdapterRemoval/fastpOut", mode: "copy", overwrite: true, pattern: "*.fastp.{json,html}" + output: + tuple nid, file("${params.projtag}_nucleotide_pcASV*.fasta") into ( pcASV_ntDiamond_ch, pcASV_nt_counts_ch, pcASV_ntmatrix_ch, pcASV_ntmuscle_ch ) + tuple nid, file("*_aminoacid_pcASV*_noTaxonomy.fasta") into ( pcASV_aaMatrix_ch, pcASV_aaDiamond_ch, pcASV_aaMafft_ch, pcASV_aaCounts_ch, pcASVEMBOSS ) + tuple nid, file("*.fasta"), file("*.clstr"), file("*.csv"), file("*.gc") into ( pcASVsupplementalfiles ) - input: - tuple sample_id, file(reads) from reads_ch + script: + // add awk script to count seqs + nid=slist2.get(x-1) + mtag="ID=" + slist2.get(x-1) + """ + set +e + cp ${params.vampdir}/bin/rename_seq.py . + awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${fasta} > ${params.projtag}_filtered_proteins.fasta + cd-hit -i ${params.projtag}_filtered_proteins.fasta -c .${nid} -o ${params.projtag}_pcASV${nid}.fasta + sed 's/>Cluster />Cluster_/g' ${params.projtag}_pcASV${nid}.fasta.clstr >${params.projtag}_pcASV${nid}.clstr + grep ">Cluster_" ${params.projtag}_pcASV${nid}.clstr >temporaryclusters.list + y=\$(grep -c ">Cluster_" ${params.projtag}_pcASV${nid}.clstr) + echo ">Cluster_"\${y}"" >> ${params.projtag}_pcASV${nid}.clstr + t=1 + b=1 + for x in \$(cat temporaryclusters.list);do + echo "Extracting \$x" + name="\$( echo \$x | awk -F ">" '{print \$2}')" + clust="pcASV"\${t}"" + echo "\${name}" + awk '/^>'\${name}'\$/,/^>Cluster_'\${b}'\$/' ${params.projtag}_pcASV${nid}.clstr > "\${name}"_"\${clust}"_tmp.list + t=\$(( \${t}+1 )) + b=\$(( \${b}+1 )) + done - output: - tuple sample_id, file("*.fastp.{json,html}") into fastp_results - tuple sample_id, file("*.filter.fq") into reads_fastp_ch - file("*.csv") into fastp_csv + ls *_tmp.list + u=1 + for x in *_tmp.list;do + name="\$(echo \$x | awk -F "_p" '{print \$1}')" + echo "\${name}" + cluster="\$(echo \$x | awk -F "_" '{print \$3}')" + echo "\${cluster}" + grep "ASV" \$x | awk -F ", " '{print \$2}' | awk -F "_" '{print \$1}' | awk -F ">" '{print \$2}' > \${name}_\${cluster}_seqs_tmps.list + seqtk subseq ${asvs} \${name}_\${cluster}_seqs_tmps.list > \${name}_\${cluster}_nucleotide_sequences.fasta + vsearch --cluster_fast \${name}_\${cluster}_nucleotide_sequences.fasta --id 0.2 --centroids \${name}_\${cluster}_centroids.fasta + grep ">" \${name}_\${cluster}_centroids.fasta >> \${name}_\${cluster}_tmp_centroids.list + for y in \$( cat \${name}_\${cluster}_tmp_centroids.list );do + echo ">\${cluster}_type"\$u"" >> \${name}_\${cluster}_tmp_centroid.newheaders + u=\$(( \${u}+1 )) + done + u=1 + ./rename_seq.py \${name}_\${cluster}_centroids.fasta \${name}_\${cluster}_tmp_centroid.newheaders \${cluster}_types_labeled.fasta + done + cat *_types_labeled.fasta >> ${params.projtag}_nucleotide_pcASV${nid}_noTaxonomy.fasta + grep -w "*" ${params.projtag}_pcASV${nid}.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list + grep -w "*" ${params.projtag}_pcASV${nid}.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list + paste -d"," temporaryclusters.list tmphead.list >tmp.info.csv + grep ">" ${params.projtag}_pcASV${nid}.fasta >lala.list + j=1 + for x in \$(cat lala.list);do + echo ">${params.projtag}_pcASV\${j}" >>${params.projtag}_aminoheaders.list + echo "\${x},>${params.projtag}_pcASV\${j}" >>tmpaminotype.info.csv + j=\$(( \${j}+1 )) + done + rm lala.list + awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list + for x in \$(cat tmporder.list);do + grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list + done + paste -d "," temporaryclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_pcASVCluster${nid}_summary.csv + ./rename_seq.py ${params.projtag}_pcASV${nid}.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_aminoacid_pcASV${nid}_noTaxonomy.fasta + stats.sh in=${params.projtag}_aminoacid_pcASV${nid}_noTaxonomy.fasta gc=${params.projtag}_pcASV${nid}_aminoacid_clustered.gc gcformat=4 + stats.sh in=${params.projtag}_nucleotide_pcASV${nid}_noTaxonomy.fasta gc=${params.projtag}_pcASV${nid}_nucleotide_clustered.gc gcformat=4 + awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${fasta} >${params.projtag}_pcASV${nid}_problematic_translations.fasta + if [ `wc -l ${params.projtag}_pcASV${nid}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then + grep ">" ${params.projtag}_pcASV${nid}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list + seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_pcASV${nid}_problematic_nucleotides.fasta + else + rm ${params.projtag}_pcASV${nid}_problematic_translations.fasta + fi + rm *.list + rm Cluster* + rm *types* + rm *tmp* + rm ${params.projtag}_pcASV${nid}.fast* + """ + } - script: - """ - echo ${sample_id} + if (!params.skipTaxonomy) { - fastp -i ${reads[0]} -I ${reads[1]} -o left-${sample_id}.filter.fq -O right-${sample_id}.filter.fq --detect_adapter_for_pe \ - --average_qual 25 -c --overrepresentation_analysis --html ${sample_id}.fastp.html --json ${sample_id}.fastp.json --thread ${task.cpus} \ - --report_title ${sample_id} + if (params.dbtype == "NCBI") { - bash get_readstats.sh ${sample_id}.fastp.json - """ - } + process pcASV_Nucleotide_Taxonomy_Inference_NCBI { - process Primer_Removal_DC { + label 'high_cpus' - label 'norm_cpus' + tag "${mtag}" - tag "${sample_id}" + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/PrimerRemoval", mode: "copy", overwrite: true + input: + tuple nid, file(asvs) from pcASV_ntDiamond_ch - input: - tuple sample_id, file(reads) from reads_fastp_ch + output: + file("*.fasta") into ( pcASV_labeled ) + tuple file("*_phyloformat.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AAdiamond + tuple nid, file("*_summary_for_plot.csv") into taxplot3 + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_pcasvnt + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvnt - output: - tuple sample_id, file("*bbduk*.fastq.gz") into ( reads_bbduk_ch, readsforqc2 ) + script: + mtag="ID=" + nid + """ + set +e + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + if [[ ${params.ncbitax} == "true" ]] + then diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop staxids sskingdoms skingdoms sphylums --max-target-seqs 1 --max-hsps 1 + else diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + fi + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[pcASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "[NCBI Taxonomy ID],[Taxonomic classification from NCBI]" > ncbi_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Checking for \$s hit in diamond output" + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">"\${s}"_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "\$line" | awk -F "\t" '{print \$14","\$16"::"\$18"::"\$17}' >> ncbi_classification.list + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + if [[ "${params.ncbitax}" == "true" ]] + then echo "N/A" >> ncbi_classification.list + fi + echo "\$s done." + fi + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" && "${params.ncbitax}" == "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.lca}" == "T" && "${params.ncbitax}" != "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.ncbitax}" == "true" && "${params.lca}" != "T"]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + else + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + echo "skipped" >> \${name}_quick_Taxbreakdown.csv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } else if (params.dbtype== "RVDB") { + + process pcASV_Nucleotide_Taxonomy_Inference_RVDB { + + label 'high_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' + + input: + tuple nid, file(asvs) from pcASV_ntDiamond_ch + + output: + file("*.fasta") into ( pcASV_labeled ) + tuple file("*_phyloformat.csv"), file("*_summaryTable.tsv"), file("*dmd.out") into summary_AAdiamond + tuple nid, file("*_summary_for_plot.csv") into taxplot3 + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_pcasvnt + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvnt + + script: + mtag="ID=" + nid + """ + set +e + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + diamond blastx -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[pcASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Using RVDB headers." + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + echo "\$s done." + fi + echo "Done with \$s" + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" ]] + then paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list >> \${name}_quick_Taxbreakdown.csv + else paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } + } - script: - // add cocktail primer removal, easy, set a list, line by line seperated by , in a for loop - if ( params.fwd == "" && params.rev == "" && !params.multi) { - """ - bbduk.sh in1=${reads[0]} out=${sample_id}_bb_R1.fastq.gz ftl=${params.defaultFwdTrim} t=${task.cpus} - bbduk.sh in=${reads[1]} out=${sample_id}_bb_R2.fastq.gz ftl=${params.defaultRevTrim} t=${task.cpus} - repair.sh in1=${sample_id}_bb_R1.fastq.gz in2=${sample_id}_bb_R2.fastq.gz out1=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz minlength=${params.minilen} outs=sing.fq repair - """ - } else if ( params.GlobTrim && !params.GlobTrim == "" ) { - """ - FTRIM=\$( echo ${GlobTrim} | cut -f 1 -d "," ) - RTRIM=\$( echo ${GlobTrim} | cut -f 2 -d "," ) - bbduk.sh in=${reads[0]} out=${sample_id}_bb_R1.fastq.gz ftl=\${FTRIM} t=${task.cpus} - bbduk.sh in=${reads[1]} out=${sample_id}_bb_R2.fastq.gz ftl=\${RTRIM} t=${task.cpus} - repair.sh in1=${sample_id}_bb_R1.fastq.gz in2=${sample_id}_bb_R2.fastq.gz out1=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz minlength=${params.minilen} outs=sing.fq repair - """ - } else if ( params.multi && params.primers ) { - """ - bbduk.sh in=${reads[0]} in2=${reads[1]} out=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz ref=${params.primers} copyundefined=t t=${task.cpus} restrictleft=${params.primerLength} k=${params.maxkmer} ordered=t mink=${params.minkmer} ktrim=l ecco=t rcomp=t minlength=${params.minilen} tbo tpe - """ - } else { - """ - bbduk.sh in=${reads[0]} in2=${reads[1]} out=${sample_id}_bbduk_R1.fastq.gz out2=${sample_id}_bbduk_R2.fastq.gz literal=${params.fwd},${params.rev} copyundefined=t t=${task.cpus} restrictleft=${params.primerLength} k=${params.maxkmer} ordered=t mink=${params.minkmer} ktrim=l ecco=t rcomp=t minlength=${params.minilen} tbo tpe - """ - } - } - process QualityCheck_2_DC { + process Generate_Nucleotide_pcASV_Counts { - label 'low_cpus' + label 'norm_cpus' - tag "${sample_id}" + tag "${mtag}" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/FastQC/PostClean", mode: "copy", overwrite: true + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Counts", mode: "copy", overwrite: true, pattern: '*.{biome,csv,txt}' - input: - tuple sample_id, file(reads) from readsforqc2 + input: + tuple nid, file(potus) from pcASV_nt_counts_ch + file(merged) from pcASV_mergedreads_ch - output: - tuple sample_id, file("*_fastqc.{zip,html}") into fastqc2_results_OAS + output: + tuple file("*_counts.txt"), file("*_counts.biome") into pcASVcounts_vsearch + tuple nid, file("*.csv") into potu_Ncounts_for_report - script: - """ - fastqc --quiet --threads ${task.cpus} ${reads} - """ - } + script: + mtag="ID=" + nid + """ + name=\$( echo ${potus} | awk -F ".fasta" '{print \$1}') + vsearch --usearch_global ${merged} --db ${potus} --id .${nid} --threads ${task.cpus} --otutabout \${name}_counts.txt --biomout \${name}_counts.biome + cat \${name}_counts.txt | tr "\t" "," >\${name}_count.csv + sed 's/#OTU ID/OTU_ID/g' \${name}_count.csv >\${name}_counts.csv + rm \${name}_count.csv + """ + } - process Read_Merging_DC { + process Generate_pcASV_Nucleotide_Matrix { - label 'norm_cpus' + label 'low_cpus' - tag "${sample_id}" + tag "${mtag}" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/Individual", mode: "copy", overwrite: true, pattern: "*mergedclean.fastq" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/Individual/notmerged", mode: "copy", overwrite: true, pattern: "*notmerged*.fastq" + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Matrix", mode: "copy", overwrite: true - input: - tuple sample_id, file(reads) from reads_bbduk_ch + input: + tuple nid, file(potus) from pcASV_ntmatrix_ch - output: - file("*_mergedclean.fastq") into reads_vsearch1_ch - file("*.name") into names - file("*notmerged*.fastq") into notmerged + output: + file("*.matrix") into pcASVclustmatrices + tuple nid, file("*PercentID.matrix") into potu_nucl_heatmap - script: - """ - vsearch --fastq_mergepairs ${reads[0]} --reverse ${reads[1]} --threads ${task.cpus} --fastqout ${sample_id}_mergedclean.fastq --fastqout_notmerged_fwd ${sample_id}_notmerged_fwd.fastq --fastqout_notmerged_rev ${sample_id}_notmerged_rev.fastq --fastq_maxee ${params.maxEE} --relabel ${sample_id}. - echo ${sample_id} > ${sample_id}.name - """ + script: + //check --percent-id second clustalo + mtag="ID=" + nid + """ + name=\$( echo ${potus} | awk -F ".fasta" '{print \$1}') + clustalo -i ${potus} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} + clustalo -i ${potus} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} + cat \${name}_PercentIDq.matrix | tr " " "," | grep "," >\${name}_PercentID.matrix + rm \${name}_PercentIDq.matrix + """ + } - } + if (!params.skipPhylogeny) { - process Compile_Reads_DC { + process pcASV_Nucleotide_Phylogeny { - label 'low_cpus' + label 'norm_cpus' - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true + tag "${mtag}" - input: - file(reads) from reads_vsearch1_ch - .collect() + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/ModelTest", mode: "copy", overwrite: true, pattern: '*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Nucleotide/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' - output: - file("*_all_merged_preFilt_preClean.fastq") into collect_samples_ch + input: + tuple nid, file(prot) from pcASV_ntmuscle_ch - script: - """ - cat ${reads} >>${params.projtag}_all_merged_preFilt_preClean.fastq - """ - } + output: + tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into pcASV_nucleotide_phylogeny_results + tuple nid, file("*iq.treefile") into potu_Ntree_plot + + script: + mtag="ID=" + nid + """ + pre=\$( echo ${prot} | awk -F "_noTax" '{print \$1}' ) + if [[ \$(grep -c ">" ${prot}) -gt 499 ]]; then algo="super5"; else algo="mpc"; fi + ${tools}/muscle5.0.1278_linux64 -"\${algo}" ${prot} -out \${pre}_ALN.fasta -threads ${task.cpus} -quiet + trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + o-trim-uninformative-columns-from-alignment \${pre}_aln.fasta + mv \${pre}_aln.fasta-TRIMMED ./\${pre}_Aligned_informativeonly.fasta + # pcASV_Nucleotide_ModelTest + modeltest-ng -i \${pre}_Aligned_informativeonly.fasta -p ${task.cpus} -o \${pre}_noTaxonomy_mt -d nt -s 203 --disable-checkpoint - process Compile_Names_DC { + # pcASV_Nucleotide_Phylogeny + if [ "${params.iqCustomnt}" != "" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq --redo -T auto ${params.iqCustomnt} - label 'low_cpus' + elif [[ "${params.ModelTnt}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq -m \${mod} --redo-nt auto -b ${params.boots} - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging", mode: "copy", overwrite: true + elif [[ "${params.ModelTnt}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - input: - file(names) from names - .collect() + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq -m MFP --redo -nt auto -b ${params.boots} - output: - file("*sample_ids.list") into ( samplelist, samplistpotu ) + elif [ "${params.parametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - script: - """ - cat ${names} >>${params.projtag}_sample_ids.list - """ - } + else + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_noTaxonomy_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } + } - process Length_Filtering_DC { + process pcASV_AminoAcid_Matrix { - label 'norm_cpus' + label 'low_cpus' - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/LengthFiltering", mode: "copy", overwrite: true, pattern: "*_merged_preFilt*.fasta" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging", mode: "copy", overwrite: true, pattern: "*Lengthfiltered.fastq" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/Histograms/pre_length_filtering", mode: "copy", overwrite: true, pattern: "*preFilt_*st.txt" - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/Histograms/post_length_filtering", mode: "copy", overwrite: true, pattern: "*postFilt_*st.txt" + tag "${mtag}" - input: - file(reads) from collect_samples_ch + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Matrix", mode: "copy", overwrite: true - output: - file("*_merged_preFilt_clean.fasta") into ( nuclCounts_mergedreads_ch, pcASV_mergedreads_ch ) - file("*_merged_clean_Lengthfiltered.fastq") into reads_vsearch2_ch - - file("*preFilt_preClean_baseFrequency_hist.csv") into prefilt_basefreq - file("*preFilt_preClean_qualityScore_hist.csv") into prefilt_qualityscore - file("*preFilt_preClean_gcContent_hist.csv") into prefilt_gccontent - file("*preFilt_preClean_averageQuality_hist.csv") into prefilt_averagequality - file("*preFilt_preClean_length_hist.csv") into prefilt_length - - file("*postFilt_baseFrequency_hist.csv") into postFilt_basefreq - file("*postFilt_qualityScore_hist.csv") into postFilt_qualityscore - file("*postFilt_gcContent_hist.csv") into postFilt_gccontent - file("*postFilt_averageQuaulity_hist.csv") into postFilt_averagequality - file("*postFilt_length_hist.csv") into postFilt_length - - file("reads_per_sample_preFilt_preClean.csv") into reads_per_sample_preFilt - file("read_per_sample_postFilt_postClean.csv") into reads_per_sample_postFilt - script: - """ - bbduk.sh in=${reads} bhist=${params.projtag}_all_merged_preFilt_preClean_baseFrequency_hist.txt qhist=${params.projtag}_all_merged_preFilt_preClean_qualityScore_hist.txt gchist=${params.projtag}_all_merged_preFilt_preClean_gcContent_hist.txt aqhist=${params.projtag}_all_merged_preFilt_preClean_averageQuality_hist.txt lhist=${params.projtag}_all_merged_preFilt_preClean_length_hist.txt gcbins=auto - for x in *preFilt*hist.txt;do - pre=\$(echo \$x | awk -F ".txt" '{print \$1}') - cat \$x | tr "\t" "," > \${pre}.csv - rm \$x - done - reformat.sh in=${reads} out=${params.projtag}_preFilt_preclean.fasta t=${task.cpus} - echo "sample,reads" >> reads_per_sample_preFilt_preClean.csv - grep ">" ${params.projtag}_preFilt_preclean.fasta | awk -F ">" '{print \$2}' | awk -F "." '{print \$1}' | sort --parallel=${task.cpus} | uniq -c | sort -brg --parallel=${task.cpus} | awk '{print \$2","\$1}' >> reads_per_sample_preFilt_preClean.csv - rm ${params.projtag}_preFilt_preclean.fasta - fastp -i ${reads} -o ${params.projtag}_merged_preFilt_clean.fastq -b ${params.maxLen} -l ${params.minLen} --thread ${task.cpus} -n 1 - reformat.sh in=${params.projtag}_merged_preFilt_clean.fastq out=${params.projtag}_merged_preFilt_clean.fasta t=${task.cpus} - bbduk.sh in=${params.projtag}_merged_preFilt_clean.fastq out=${params.projtag}_merged_clean_Lengthfiltered.fastq minlength=${params.maxLen} maxlength=${params.maxLen} t=${task.cpus} - bbduk.sh in=${params.projtag}_merged_clean_Lengthfiltered.fastq bhist=${params.projtag}_all_merged_postFilt_baseFrequency_hist.txt qhist=${params.projtag}_all_merged_postFilt_qualityScore_hist.txt gchist=${params.projtag}_all_merged_postFilt_gcContent_hist.txt aqhist=${params.projtag}_all_merged_postFilt_averageQuaulity_hist.txt lhist=${params.projtag}_all_merged_postFilt_length_hist.txt gcbins=auto - for x in *postFilt*hist.txt;do - pre=\$(echo \$x | awk -F ".txt" '{print \$1}') - cat \$x | tr "\t" "," > \${pre}.csv - rm \$x - done - reformat.sh in=${params.projtag}_merged_clean_Lengthfiltered.fastq out=${params.projtag}_merged_clean_Lengthfiltered.fasta t=${task.cpus} - echo "sample,reads" >> read_per_sample_postFilt_postClean.csv - grep ">" ${params.projtag}_merged_clean_Lengthfiltered.fasta | awk -F ">" '{print \$2}' | awk -F "." '{print \$1}' | sort --parallel=${task.cpus} | uniq -c | sort -brg --parallel=${task.cpus} | awk '{print \$2","\$1}' >> read_per_sample_postFilt_postClean.csv - """ - } + input: + tuple nid, file(prot) from pcASV_aaMatrix_ch - process Extract_Uniques_DC { + output: + file("*.matrix") into pcASVaaMatrix + tuple nid, file("*PercentID.matrix") into potu_aa_heatmap - label 'norm_cpus' + script: + mtag="ID=" + nid + """ + name=\$( echo ${prot} | awk -F ".fasta" '{print \$1}') + clustalo -i ${prot} --distmat-out=\${name}_PairwiseDistanceq.matrix --full --force --threads=${task.cpus} + clustalo -i ${prot} --distmat-out=\${name}_PercentIDq.matrix --percent-id --full --force --threads=${task.cpus} + cat \${name}_PercentIDq.matrix | tr " " "," | grep "," >\${name}_PercentID.matrix + rm \${name}_PercentIDq.matrix + """ + } - publishDir "${params.workingdir}/${params.outdir}/DataCheck/ReadProcessing/ReadMerging/Uniques", mode: "copy", overwrite: true + if (!params.skipEMBOSS) { - input: - file(reads) from reads_vsearch2_ch + process pcASV_EMBOSS_Analyses { - output: - file("*unique_sequences.fasta") into reads_vsearch3_ch + label 'low_cpus' - script: - """ - vsearch --derep_fulllength ${reads} --sizeout --relabel_keep --output ${params.projtag}_unique_sequences.fasta - """ - } + tag "${mtag}" - process Identify_ASVs_DC { + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/2dStructure", mode: "copy", overwrite: true, pattern: '*.{garnier}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/HydrophobicMoment", mode: "copy", overwrite: true, pattern: '*HydrophobicMoments.{svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/IsoelectricPoint", mode: "copy", overwrite: true, pattern: '*IsoelectricPoint.{iep,svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/ProteinProperties", mode: "copy", overwrite: true, pattern: '*.{pepstats,pepinfo}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/ProteinProperties/Plots", mode: "copy", overwrite: true, pattern: '*PropertiesPlot.{svg}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/EMBOSS/2dStructure/Plots", mode: "copy", overwrite: true, pattern: '*Helical*.{svg}' - label 'norm_cpus' + input: + tuple nid, file(prot) from pcASVEMBOSS - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/ASVs/ChimeraCheck", mode: "copy", overwrite: true + output: + tuple file("*.garnier"), file("*HydrophobicMoments.svg"), file("*IsoelectricPoint*"), file("*.pepstats"), file("*PropertiesPlot*"), file("*Helical*") into pcASV_emboss - input: - file(reads) from reads_vsearch3_ch + script: + // check do I need for loop + mtag="ID=" + nid + """ + name=\$( echo ${prot} | awk -F ".fasta" '{print \$1}') + garnier -sequence ${prot} -outfile \${name}_2dStructures.garnier + hmoment -seqall ${prot} -graph svg -plot + mv hmoment.svg ./"\${name}"_HydrophobicMoments.svg + iep -sequence ${prot} -graph svg -plot -outfile "\${name}"_IsoelectricPoint.iep + mv iep.svg ./"\${name}"_IsoelectricPoint.svg + pepstats -sequence ${prot} -outfile \${name}_ProteinProperties.pepstats + grep ">" ${prot} | awk -F ">" '{print \$2}' > tmpsequence.list + for x in \$(cat tmpsequence.list);do + echo \$x > tmp1.list + seqtk subseq ${prot} tmp1.list > tmp2.fasta + len=\$(tail -1 tmp2.fasta | awk '{print length}') + pepinfo -sequence tmp2.fasta -graph svg -outfile "\$x"_PropertiesPlot.pepinfo + mv pepinfo.svg ./"\$x"_PropertiesPlot.svg + cat "\$x"_PropertiesPlot.pepinfo >> "\${name}"_PropertiesPlot.pepinfo + rm "\$x"_PropertiesPlot.pepinfo + pepnet -sask -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len + mv pepnet.svg ./"\$x"_HelicalNet.svg + pepwheel -sequence tmp2.fasta -graph svg -sbegin1 1 -send1 \$len + mv pepwheel.svg ./"\$x"_HelicalWheel.svg + rm tmp1.list tmp2.fasta + done + rm tmpsequence.list + """ + } + } - output: - file("*notChecked.fasta") into reads_vsearch4_ch + if (!params.skipTaxonomy) { - script: - """ - vsearch --cluster_unoise ${reads} --unoise_alpha ${params.alpha} --relabel ASV --centroids ${params.projtag}_notChecked.fasta --minsize ${params.minSize} - """ - } + if (params.dbtype == "NCBI") { - process Chimera_Check_DC { + process pcASV_AminoAcid_Taxonomy_Inference_NCBI { - label 'norm_cpus' + label 'high_cpus' - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/ASVs", mode: "copy", overwrite: true + tag "${mtag}" - input: - file(fasta) from reads_vsearch4_ch + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' - output: - file("*ASVs.fasta") into ( reads_vsearch5_ch, nucl2aa, asvsforAminotyping, asvfastaforcounts, asvaminocheck ) + input: + tuple nid, file(asvs) from pcASV_aaDiamond_ch - script: - """ - vsearch --uchime3_denovo ${fasta} --relabel ASV --nonchimeras ${params.projtag}_ASVs.fasta - """ - } + output: + file("*.fasta") into ( pcASV_labeledAA ) + tuple file("*phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_potuaadiamond + tuple nid, file("*_summary_for_plot.csv") into taxplot4 + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_pcasvaa + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvaa + + script: + mtag="ID=" + nid + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + if [[ ${params.ncbitax} == "true" ]] + then diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop staxids sskingdoms skingdoms sphylums --max-target-seqs 1 --max-hsps 1 + else diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + fi + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[pcASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "[NCBI Taxonomy ID],[Taxonomic classification from NCBI]" > ncbi_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Checking for \$s hit in diamond output" + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "." '{ print \$2 }' | awk -F "[" '{ print \$1 }' | awk -F " " print substr(\$0, index(\$0,\$2)) | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "[" '{ print \$2 }' | awk -F "]" '{ print \$1 }'| sed 's/ /_/g') + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">"\${s}"_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + if [[ ${params.ncbitax} == "true" ]] + then echo "\$line" | awk -F "\t" '{print \$14","\$16"::"\$18"::"\$17}' >> ncbi_classification.list + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + if [[ "${params.ncbitax}" == "true" ]] + then echo "N/A" >> ncbi_classification.list + fi + echo "\$s done." + fi + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" && "${params.ncbitax}" == "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.lca}" == "T" && "${params.ncbitax}" != "true" ]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + elif [[ "${params.ncbitax}" == "true" && "${params.lca}" != "T"]] + then + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list ncbi_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list ncbi_classification.list >> \${name}_quick_Taxbreakdown.csv + else + paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + echo "skipped" >> \${name}_quick_Taxbreakdown.csv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } else if (params.dbtype== "RVDB") { + + process pcASV_AminoAcid_Taxonomy_Inference_RVDB { + + label 'high_cpus' + + tag "${mtag}" + + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/SummaryFiles", mode: "copy", overwrite: true, pattern: '*.{csv,tsv}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy/DiamondOutput", mode: "copy", overwrite: true, pattern: '*dmd.{out}' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Taxonomy", mode: "copy", overwrite: true, pattern: '*.{fasta}' + + input: + tuple nid, file(asvs) from pcASV_aaDiamond_ch + + output: + file("*.fasta") into ( pcASV_labeledAA ) + tuple file("*phyloformat.csv"), file("*summaryTable.tsv"), file("*dmd.out") into summary_potuaadiamond + tuple nid, file("*_summary_for_plot.csv") into taxplot4 + tuple nid, file("*_quick_Taxbreakdown.csv") into tax_table_pcasvaa + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvaa + + script: + mtag="ID=" + nid + """ + cp ${params.vampdir}/bin/rename_seq.py . + virdb=${params.dbdir}/${params.dbname} + if [[ ${params.measurement} == "bitscore" ]] + then measure="--min-score ${params.bitscore}" + elif [[ ${params.measurement} == "evalue" ]] + then measure="-e ${params.evalue}" + else measure="--min-score ${params.bitscore}" + fi + grep ">" \${virdb} > headers.list + headers="headers.list" + name=\$( echo ${asvs} | awk -F ".fasta" '{print \$1}') + diamond blastp -q ${asvs} -d \${virdb} -p ${task.cpus} --id ${params.minID} -l ${params.minaln} \${measure} --${params.sensitivity} -o "\$name"_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 + echo "Preparing lists to generate summary .csv's" + echo "[Best hit accession number]" > access.list + echo "[e-value]" > evalue.list + echo "[Bitscore]" > bit.list + echo "[Percent ID (aa)]" > pid.list + echo "[Organism ID]" > "\$name"_virus.list + echo "[Gene]" > "\$name"_genes.list + echo "[pcASV#]" > otu.list + echo "[Sequence length]" > length.list + grep ">" ${asvs} | awk -F ">" '{print \$2}' > seqids.lst + if [[ ${params.lca} == "T" ]] + then grep -w "LCA" ${params.dbanno}/*.txt > lcainfo.list + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + else echo "skipped" >> \${name}_quick_Taxbreakdown.csv + echo "[Taxonomic classification from RVDB annotations]" > lca_classification.list + fi + echo "extracting genes and names" + touch new_"\$name"_asvnames.txt + for s in \$(cat seqids.lst);do + echo "Using RVDB headers." + if [[ "\$(grep -wc "\$s" "\$name"_dmd.out)" -eq 1 ]];then + echo "Yep, there was a hit for \$s" + echo "Extracting the information now:" + acc=\$(grep -w "\$s" "\$name"_dmd.out | awk '{print \$3}' | awk -F "|" '{print \$3}') + echo "\$s" >> otu.list + echo "\$acc" >> access.list + line="\$(grep -w "\$s" "\$name"_dmd.out)" + echo "\$line" | awk '{print \$10}' >> evalue.list + echo "\$line" | awk '{print \$11}' >> bit.list + echo "\$line" | awk '{print \$12}' >> pid.list + echo "\$line" | awk '{print \$2}' >> length.list + echo "Extracting virus and gene ID for \$s now" + gene=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$1 }' | sed 's/ /_/g') && + echo "\$gene" | sed 's/_/ /g' >> "\$name"_genes.list + virus=\$(grep -w "\$acc" "\$headers" | awk -F "|" '{ print \$6 }' | awk -F "[" '{ print \$2 }' | awk -F "]" '{print \$1}' | sed 's/ /_/g') && + echo "\$virus" | sed 's/_/ /g' >> "\$name"_virus.list + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then if [[ \$(grep -w "\$acc" ${params.dbanno}/*.txt | wc -l) -eq 1 ]] + then group=\$(grep -w "\$acc" ${params.dbanno}/*.txt | awk -F ":" '{print \$1}') + lcla=\$(grep -w "\$group" lcainfo.list | awk -F "\t" '{print \$2}') + echo "\$lcla" >> lca_classification.list + else echo "Viruses" >> lca_classification.list + fi + fi + echo "\$s done." + else + echo "Ugh, there was no hit for \$s .." + echo "We still love \$s though and we will add it to the final fasta file" + echo "\$s" >> otu.list + echo "NO_HIT" >> access.list + echo "NO_HIT" >> "\$name"_genes.list + echo "NO_HIT" >> "\$name"_virus.list + echo "NO_HIT" >> evalue.list + echo "NO_HIT" >> bit.list + echo "NO_HIT" >> pid.list + echo "NO_HIT" >> length.list + virus="NO" + gene="HIT" + echo ">\${s}_"\$virus"_"\$gene"" >> new_"\$name"_asvnames.txt + if [[ "${params.lca}" == "T" ]] + then echo "N/A" >> lca_classification.list + fi + echo "\$s done." + fi + echo "Done with \$s" + done + echo "Now editing "\$name" fasta headers" + ###### rename_seq.py + ./rename_seq.py ${asvs} new_"\$name"_asvnames.txt "\$name"_TaxonomyLabels.fasta + awk 'BEGIN {RS=">";FS="\\n";OFS=""} NR>1 {print ">"\$1; \$1=""; print}' "\$name"_TaxonomyLabels.fasta >"\$name"_tmpssasv.fasta + echo "[Sequence header]" > newnames.list + cat new_"\$name"_asvnames.txt >> newnames.list + touch sequence.list + echo " " > sequence.list + grep -v ">" "\$name"_tmpssasv.fasta >> sequence.list + rm "\$name"_tmpssasv.fasta + if [[ "${params.lca}" == "T" ]] + then paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list lca_classification.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + paste -d"," otu.list access.list "\$name"_virus.list "\$name"_genes.list lca_classification.list >> \${name}_quick_Taxbreakdown.csv + else paste -d "," sequence.list "\$name"_virus.list "\$name"_genes.list otu.list newnames.list length.list bit.list evalue.list pid.list access.list >> "\$name"_phyloformat.csv + paste -d"\t" otu.list access.list "\$name"_virus.list "\$name"_genes.list sequence.list length.list bit.list evalue.list pid.list >> "\$name"_summaryTable.tsv + fi + for x in *phyloformat.csv;do + echo "\$x" + lin=\$(( \$(wc -l \$x | awk '{print \$1}')-1)) + tail -"\$lin" \$x | awk -F "," '{print \$2}' > tmpcol.list; + sed 's/ /_/g' tmpcol.list > tmp2col.list; + cat tmp2col.list | sort | uniq -c | sort -nr | awk '{print \$2","\$1}' > \${name}_summary_for_plot.csv; + rm tmpcol.list tmp2col.list + done + awk -F "," '{print \$1","\$3"("\$2")"}' \${name}_quick_Taxbreakdown.csv >> \${name}_quicker_taxbreakdown.csv + rm evalue.list sequence.list bit.list pid.list length.list seqids.lst otu.list *asvnames.txt "\$name"_virus.list "\$name"_genes.list newnames.list access.list headers.list + """ + } + } + } - process NucleotideBased_ASV_clustering_DC { + if (!params.skipPhylogeny) { - label 'norm_cpus' + process pcASV_Protein_Phylogeny { - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/Nucleotide", mode: "copy", overwrite: true, pattern: '*{.csv}' + label 'norm_cpus' - input: - file(fasta) from reads_vsearch5_ch + tag "${mtag}" - output: - file("number_per_percentage_nucl.csv") into number_per_percent_nucl_plot + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/Alignment", mode: "copy", overwrite: true, pattern: '*aln.*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/Modeltest", mode: "copy", overwrite: true, pattern: '*mt*' + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Phylogeny/IQ-TREE", mode: "copy", overwrite: true, pattern: '*iq*' - script: - if (params.datacheckntIDlist) { - """ - for id in `echo ${params.datacheckntIDlist} | tr "," "\\n"`;do - vsearch --cluster_fast ${fasta} --centroids ${params.projtag}_ncASV\${id}.fasta --threads ${task.cpus} --relabel OTU --id \${id} - done - for x in *ncASV*.fasta;do - id=\$( echo \$x | awk -F "_ncASV" '{print \$2}' | awk -F ".fasta" '{print \$1}') - numb=\$( grep -c ">" \$x ) - echo "\${id},\${numb}" >> number_per_percentage_nucl.csv - done - yo=\$(grep -c ">" ${fasta}) - echo "1.0,\${yo}" >> number_per_percentage_nucl.csv - """ - } - } + input: + tuple nid, file(prot) from pcASV_aaMafft_ch + output: + tuple file("*_aln.fasta"), file("*_aln.html"), file("*.tree"), file("*.log"), file("*iq*"), file("*mt*") into pcASV_protein_phylogeny_results + tuple nid, file("*iq.treefile") into potu_Atree_plot - if (params.sing) { + script: + mtag="ID=" + nid + """ + pre=\$( echo ${prot} | awk -F ".fasta" '{print \$1}' ) + if [[ \$(grep -c ">" ${prot}) -gt 499 ]]; then algo="super5"; else algo="mpc"; fi + ${tools}/muscle5.0.1278_linux64 -"\${algo}" ${prot} -out \${pre}_ALN.fasta -threads ${task.cpus} -quiet + trimal -in \${pre}_ALN.fasta -out \${pre}_aln.fasta -keepheader -fasta -automated1 -htmlout \${pre}_aln.html + o-trim-uninformative-columns-from-alignment \${pre}_aln.fasta + mv \${pre}_aln.fasta-TRIMMED ./\${pre}_Aligned_informativeonly.fasta + # pcASV_Protein_ModelTest + modeltest-ng -i \${pre}_Aligned_informativeonly.fasta -p ${task.cpus} -o \${pre}_mt -d aa -s 203 --disable-checkpoint - process Translating_For_ProteinClustering_DC { + # pcASV_Protein_Phylogeny + if [ "${params.iqCustomaa}" != "" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq --redo -T auto ${params.iqCustomaa} - label 'low_cpus' + elif [[ "${params.ModelTaa}" != "false" && "${params.nonparametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -b ${params.boots} - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/Aminoacid/translation", mode: "copy", overwrite: true + elif [[ "${params.ModelTaa}" != "false" && "${params.parametric}" != "false" ]];then + mod=\$(tail -12 \${pre}_Aligned_informativeonly.fasta.log | head -1 | awk '{print \$6}') + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m \${mod} --redo -nt auto -bb ${params.boots} -bnni - input: - file(fasta) from nucl2aa + elif [ "${params.nonparametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -b ${params.boots} - output: - file("*ASVprotforclust.fasta") into clustering_aa - file("*_translation_report") into reportaa_VR - file("*_ASV_all.fasta") into asvfastaforaaclust + elif [ "${params.parametric}" != "false" ];then + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni - script: - """ - conda init && source activate virtualribosome + else + iqtree -s \${pre}_Aligned_informativeonly.fasta --prefix \${pre}_iq -m MFP --redo -nt auto -bb ${params.boots} -bnni + fi + """ + } + } - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASVprotforclust.fasta --report ${params.projtag}_translation_report - cp ${fasta} ${params.projtag}_ASV_all.fasta + process Generate_pcASV_Protein_Counts { - """ + label 'high_cpus' - } + tag "${mtag}" - } else { + publishDir "${params.workingdir}/${params.outdir}/Analyze/Analyses/pcASV/Aminoacid/Counts", mode: "copy", overwrite: true - process Translation_For_ProteinBased_Clustering_DC { + input: + tuple nid, file(fasta) from pcASV_aaCounts_ch + file(merged) from mergeforpcASVaacounts + file(samplist) from samplistpotu - label 'norm_cpus' + output: + tuple file("*_counts.csv"), file("*dmd.out") into potuaacounts_summary + tuple nid, file("*counts.csv") into potu_Acounts - conda 'python=2.7' + script: + // check do I need for loop + mtag="ID=" + nid + """ + set +e + potu="\$( echo ${fasta} | awk -F "_" '{print \$3}')" + diamond makedb --in ${fasta} --db ${fasta} + diamond blastx -q ${merged} -d ${fasta} -p ${task.cpus} --min-score ${params.ProtCountsBit} --id ${params.ProtCountID} -l ${params.ProtCountsLength} --${params.sensitivity} -o ${params.projtag}_\${potu}_Counts_dmd.out -f 6 qseqid qlen sseqid qstart qend qseq sseq length qframe evalue bitscore pident btop --max-target-seqs 1 --max-hsps 1 --max-hsps 1 + echo "OTU_ID" >tmp.col1.txt + echo "Generating sample id list" + grep ">" ${fasta} | awk -F ">" '{print \$2}' | sort | uniq > otuid.list + cat otuid.list >> tmp.col1.txt + echo "Beginning them counts tho my g" + for y in \$( cat ${samplist} );do + echo "Starting with \$y now ..." + grep "\$y" ${params.projtag}_\${potu}_Counts_dmd.out > tmp."\$y".out + echo "Isolated hits" + echo "Created uniq subject id list" + echo "\$y" > "\$y"_col.txt + echo "Starting my counts" + for z in \$(cat otuid.list);do + echo "Counting \$z hits" + echo "grep -wc "\$z" >> "\$y"_col.txt" + grep -wc "\$z" tmp."\$y".out >> "\$y"_col.txt + echo "\$z counted" + done + done + paste -d "," tmp.col1.txt *col.txt > ${params.projtag}_aminoacid_\${potu}_noTaxonomy_counts.csv + rm tmp* + rm *col.txt + """ + } + } - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/Aminoacid/translation", mode: "copy", overwrite: true + if (!params.skipReport) { - input: - file(fasta) from nucl2aa + if (!params.skipAdapterRemoval || !params.skipReadProcessing || !params.skipMerging) { - output: - file("*ASVprotforclust.fasta") into clustering_aa - file("*_translation_report") into reportaa_VR - file("*_ASV_all.fasta") into asvfastaforaaclust + process combine_csv { - script: - """ - ${tools}/virtualribosomev2/dna2pep.py ${fasta} -r all -x -o none --fasta ${params.projtag}_ASVprotforclust.fasta --report ${params.projtag}_translation_report - cp ${fasta} ${params.projtag}_ASV_all.fasta - """ - } - } + input: + file(csv) from fastp_csv_in2 + .collect() - process Protein_clustering_DC { + output: + file("final_reads_stats.csv") into fastp_csv_in - label 'norm_cpus' + script: + """ + cat ${csv} >all_reads_stats.csv + head -n1 all_reads_stats.csv >tmp.names.csv + cat all_reads_stats.csv | grep -v ""Sample,Total_"" >tmp.reads.stats.csv + cat tmp.names.csv tmp.reads.stats.csv >final_reads_stats.csv + rm tmp.names.csv tmp.reads.stats.csv + """ + } + } else { - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Clustering/Aminoacid", mode: "copy", overwrite: true, pattern: '*{.csv}' + process skip_combine_csv { - input: - file(fasta) from clustering_aa - file(asvs) from asvfastaforaaclust + output: + file("filter_reads.txt") into fastp_csv_in - output: - file("number_per_percentage_prot.csv") into number_per_percent_prot_plot + script: + """ + echo "Read processing steps skipped." >filter_reads.txt + """ + } + } - script: - // add awk script to count seqs - """ - set +e - cp ${params.vampdir}/bin/rename_seq.py . - for id in `echo ${params.datacheckaaIDlist} | tr "," "\\n"`;do - if [ \${id} == ".55" ];then - word=3 - elif [ \${id} == ".65" ];then - word=4 - else - word=5 - fi - awk 'BEGIN{RS=">";ORS=""}length(\$2)>="${params.minAA}"{print ">"\$0}' ${fasta} > ${params.projtag}_filtered_proteins.fasta - cd-hit -i ${params.projtag}_filtered_proteins.fasta -n \${word} -c \${id} -o ${params.projtag}_pcASV\${id}.fasta - sed 's/>Cluster />Cluster_/g' ${params.projtag}_pcASV\${id}.fasta.clstr >${params.projtag}_pcASV\${id}.clstr - grep ">Cluster_" ${params.projtag}_pcASV\${id}.clstr >temporaryclusters.list - y=\$(grep -c ">Cluster_" ${params.projtag}_pcASV\${id}.clstr) - echo ">Cluster_"\${y}"" >> ${params.projtag}_pcASV\${id}.clstr - t=1 - b=1 - for x in \$(cat temporaryclusters.list);do - echo "Extracting \$x" - name="\$( echo \$x | awk -F ">" '{print \$2}')" - clust="pcASV"\${t}"" - echo "\${name}" - awk '/^>'\${name}'\$/,/^>Cluster_'\${b}'\$/' ${params.projtag}_pcASV\${id}.clstr > "\${name}"_"\${clust}"_tmp.list - t=\$(( \${t}+1 )) - b=\$(( \${b}+1 )) - done - ls *_tmp.list - u=1 - for x in *_tmp.list;do - name="\$(echo \$x | awk -F "_p" '{print \$1}')" - echo "\${name}" - cluster="\$(echo \$x | awk -F "_" '{print \$3}')" - echo "\${cluster}" - grep "ASV" \$x | awk -F ", " '{print \$2}' | awk -F "_" '{print \$1}' | awk -F ">" '{print \$2}' > \${name}_\${cluster}_seqs_tmps.list - seqtk subseq ${asvs} \${name}_\${cluster}_seqs_tmps.list > \${name}_\${cluster}_nucleotide_sequences.fasta - vsearch --cluster_fast \${name}_\${cluster}_nucleotide_sequences.fasta --id 0.2 --centroids \${name}_\${cluster}_centroids.fasta - grep ">" \${name}_\${cluster}_centroids.fasta >> \${name}_\${cluster}_tmp_centroids.list - for y in \$( cat \${name}_\${cluster}_tmp_centroids.list );do - echo ">\${cluster}_type"\$u"" >> \${name}_\${cluster}_tmp_centroid.newheaders - u=\$(( \${u}+1 )) - done - u=1 - ./rename_seq.py \${name}_\${cluster}_centroids.fasta \${name}_\${cluster}_tmp_centroid.newheaders \${cluster}_types_labeled.fasta - done - cat *_types_labeled.fasta >> ${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$3}' | awk -F "." '{print \$1}' >tmphead.list - grep -w "*" ${params.projtag}_pcASV\${id}.clstr | awk '{print \$2}' | awk -F "," '{print \$1}' >tmplen.list - paste -d"," temporaryclusters.list tmphead.list >tmp.info.csv - grep ">" ${params.projtag}_pcASV\${id}.fasta >lala.list - j=1 - for x in \$(cat lala.list);do - echo ">${params.projtag}_pcASV\${j}" >>${params.projtag}_aminoheaders.list - echo "\${x},>${params.projtag}_pcASV\${j}" >>tmpaminotype.info.csv - j=\$(( \${j}+1 )) - done - rm lala.list - awk -F "," '{print \$2}' tmp.info.csv >>tmporder.list - for x in \$(cat tmporder.list);do - grep -w "\$x" tmpaminotype.info.csv | awk -F "," '{print \$2}' >>tmpder.list - done - paste -d "," temporaryclusters.list tmplen.list tmphead.list tmpder.list >${params.projtag}_pcASVCluster\${id}_summary.csv - ./rename_seq.py ${params.projtag}_pcASV\${id}.fasta ${params.projtag}_aminoheaders.list ${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta - stats.sh in=${params.projtag}_aminoacid_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_aminoacid_clustered.gc gcformat=4 overwrite=true - stats.sh in=${params.projtag}_nucleotide_pcASV\${id}_noTaxonomy.fasta gc=${params.projtag}_pcASV\${id}_nucleotide_clustered.gc gcformat=4 overwrite=true - awk 'BEGIN{RS=">";ORS=""}length(\$2)<"${params.minAA}"{print ">"\$0}' ${fasta} >${params.projtag}_pcASV\${id}_problematic_translations.fasta - if [ `wc -l ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk '{print \$1}'` -gt 1 ];then - grep ">" ${params.projtag}_pcASV\${id}_problematic_translations.fasta | awk -F ">" '{print \$2}' > problem_tmp.list - seqtk subseq ${asvs} problem_tmp.list > ${params.projtag}_pcASV\${id}_problematic_nucleotides.fasta - else - rm ${params.projtag}_pcASV\${id}_problematic_translations.fasta - fi - rm *.list - rm Cluster* - rm *types* - rm *tmp* - rm ${params.projtag}_pcASV\${id}.fast* - done - for x in *aminoacid*noTaxonomy.fasta;do - id=\$( echo \$x | awk -F "_noTax" '{print \$1}' | awk -F "pcASV" '{print \$2}') - numb=\$( grep -c ">" \$x) - echo "\${id},\${numb}" >> number_per_percentage_protz.csv - done - yesirr=\$( wc -l number_per_percentage_protz.csv | awk '{print \$1}') - tail -\$(( \${yesirr}-1 )) number_per_percentage_protz.csv > number_per_percentage_prot.csv - head -1 number_per_percentage_protz.csv >> number_per_percentage_prot.csv - rm number_per_percentage_protz.csv - """ - } + //NEW REPORT !!!!!!!!!!!!!!!!! + /*Report_ASV + asv_counts_plots -> ${params.projtag}_ASV_counts.csv + taxplot1 -> ${params.projtag}_ASV_summary_for_plot.csv + asv_heatmap -> ${params.projtag}_ASV_PercentID.matrix + nucl_phyl_plot -> ${params.projtag}_ASV_iq.treefile + file("*_ASV_Grouping.csv") into asvgroupscsv + "${params.projtag}_ASV_Groupingcounts.csv") into asvgroupcounts + *_quick_Taxbreakdown.csv") into tax_table_asv + \\${params.projtag}_ASV_Group_Reps_iq.treefile + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_asv + */ + + report_asv = Channel.create() + asv_counts_plots.mix(taxplot_asv, asv_heatmap, nucl_phyl_plot_asv, asvgroupscsv, asvgroupcounts, asv_group_rep_tree, tax_table_asv, tax_nodCol_asv).flatten().buffer(size:9).dump(tag:'asv').into(report_asv) + + if (params.ncASV) { + report_ncasv = Channel.create() + notu_counts_plots.mix(taxplot_ncasv, notu_heatmap, nucl_phyl_plot_ncasv, tax_table_ncasv, tax_nodCol_ncasv).groupTuple(by:0, size:6).dump(tag:'ncasv').into(report_ncasv) + /* + notu_counts_plots -> ${params.projtag}_ncASV${id}_counts.csv + taxplot1a -> ${params.projtag}_ncASV${id}_summary_for_plot.csv + notu_heatmap -> ${params.projtag}_ncASV${id}_PercentID.matrix + nucl_phyl_plot -> ${params.projtag}_ncASV${id}_iq.treefile + ${params.projtag}_ncASV${id}_quick_Taxbreakdown.csv + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_ncasv + */ + } else { + report_ncasv = Channel.empty() + } - process combine_csv_DC { + if (params.pcASV) { + report_pcasv_aa = Channel.create() + potu_Acounts.mix(taxplot4, potu_aa_heatmap, potu_Atree_plot, tax_table_pcasvaa, tax_nodCol_pcasvaa).groupTuple(by:0, size:6).dump(tag:'pcasv1').into(report_pcasv_aa) + /*Report_pcASV_AminoAcid + potu_Acounts -> ${params.projtag}_pcASV${id}_noTaxonomy_counts.csv + taxplot4 -> ${params.projtag}_aminoacid_pcASV${id}_noTaxonomy_summary_for_plot.csv + potu_aa_heatmap -> ${params.projtag}_aminoacid_pcASV${id}_noTaxonomy_PercentID.matrix + potu_Atree_plot -> ${params.projtag}_aminoacid_pcASV${id}_noTaxonomy_iq.treefile + tax_table_pcasvaa -> ${params.projtag}_aminoacid_pcASV${id}_quick_Taxbreakdown.csv + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvaa + */ + report_pcasv_nucl = Channel.create() + potu_Ncounts_for_report.mix(taxplot3, potu_nucl_heatmap, potu_Ntree_plot, tax_table_pcasvnt, tax_nodCol_pcasvnt).groupTuple(by:0, size:6).dump(tag:'pcasv2').into(report_pcasv_nucl) + /*Report_pcASV_Nucleotide + potu_Ncounts_for_report -> ${params.projtag}_nucleotide_pcASV${id}_noTaxonomy_counts.csv + taxplot3 -> ${params.projtag}_nucleotide_pcASV${id}_noTaxonomy_summary_for_plot.csv + potu_nucl_heatmap -> ${params.projtag}_nucleotide_pcASV${id}_noTaxonomy_PercentID.matrix + potu_Ntree_plot -> ${params.projtag}_nucleotide_pcASV${id}_noTaxonomy_iq.treefile + tax_table_pcasvnt -> ${params.projtag}_nucleotide_pcASV${id}_quick_Taxbreakdown.csv + tuple nid, file ("*_quicker_taxbreakdown.csv") into tax_nodCol_pcasvnt + */ + } else { + report_pcasv_aa = Channel.empty() + report_pcasv_nucl = Channel.empty() + } - input: - file(csv) from fastp_csv - .collect() + if (!params.skipAminoTyping) { + report_aminotypes = Channel.create() + aminocounts_plot.mix(taxplot2, aminotype_heatmap, amino_rax_plot, atygroupscsv, amino_group_rep_tree, amino_groupcounts, tax_table_amino, tax_nodCol_amino).flatten().buffer(size:9).dump(tag:'amino').into(report_aminotypes) + /* + Report_AminoTypes + aminocounts_plot -> ${params.projtag}_AminoType_counts.csv + taxplot2 -> ${params.projtag}_AminoTypes_summary_for_plot.csv + aminotype_heatmap -> ${params.projtag}_AminoTypes_PercentID.matrix + amino_rax_plot -> ${params.projtag}_AminoTypes_iq.treefile + atygroupscsv -> *_AminoType_Grouping.csv + amino_group_rep_tree -> ${params.projtag}_AminoType_Group_Reps_iq.treefile + params.projtag}_AminoType_Groupingcounts.csv") into amino_groupcounts + *_quick_Taxbreakdown.csv") into tax_table_amino + file ("*_quicker_taxbreakdown.csv") into tax_nodCol_amino + */ + } else { + report_aminotypes = Channel.empty() + } - output: - file("final_reads_stats.csv") into fastp_csv1 + report_all_ch = Channel.create() + report_asv.mix(report_ncasv, report_pcasv_aa, report_pcasv_nucl, report_aminotypes).map{it.flatten()}.dump(tag:'report').into(report_all_ch) - script: - """ - cat ${csv} >all_reads_stats.csv - head -n1 all_reads_stats.csv >tmp.names.csv - cat all_reads_stats.csv | grep -v ""Sample,Total_"" >tmp.reads.stats.csv - cat tmp.names.csv tmp.reads.stats.csv >final_reads_stats.csv - rm tmp.names.csv tmp.reads.stats.csv - """ + process Report { - } + label 'norm_cpus' - process Report_DataCheck { + publishDir "${params.workingdir}/${params.outdir}/Analyze/FinalReports", mode: "copy", overwrite: true - label 'norm_cpus' + input: + file(csv) from fastp_csv_in + file(files) from report_all_ch - publishDir "${params.workingdir}/${params.outdir}/DataCheck/Report", mode: "copy", overwrite: true, pattern: '*.{html}' + output: + file("*.html") into report_all_out - input: - file(fastpcsv) from fastp_csv1 - file(reads_per_sample_preFilt) from reads_per_sample_preFilt - file(read_per_sample_postFilt) from reads_per_sample_postFilt - file(preFilt_baseFrequency) from prefilt_basefreq - file(postFilt_baseFrequency) from postFilt_basefreq - file(preFilt_qualityScore) from prefilt_qualityscore - file(postFilt_qualityScore) from postFilt_qualityscore - file(preFilt_gcContent) from prefilt_gccontent - file(postFilt_gcContent) from postFilt_gccontent - file(preFilt_averageQuality) from prefilt_averagequality - file(postFilt_averageQuaulity) from postFilt_averagequality - file(preFilt_length) from prefilt_length - file(postFilt_length) from postFilt_length - file(number_per_percentage_nucl) from number_per_percent_nucl_plot - file(number_per_percentage_prot) from number_per_percent_prot_plot + script: + """ + name=\$( ls *summary_for_plot.csv | awk -F "_summary_for_plot.csv" '{print \$1}') + type=\$( ls *_counts.csv | awk -F "${params.projtag}" '{print \$2}' | awk -F "_" '{print \$2}' ) + cp ${params.vampdir}/bin/vAMPirus_Report.Rmd . + cp ${params.vampdir}/example_data/conf/vamplogo.png . + Rscript -e "rmarkdown::render('vAMPirus_Report.Rmd',output_file='\${name}_Report.html')" \${name} \ + ${params.skipReadProcessing} \ + ${params.skipMerging} \ + ${params.skipAdapterRemoval} \ + ${params.skipTaxonomy} \ + ${params.skipPhylogeny} \ + ${params.trymax} \ + ${params.stats} \ + ${params.metadata} \ + ${params.minimumCounts} \ + ${params.asvMED} \ + ${params.aminoMED} \ + \${type} \ + ${params.nodeCol} + """ + } - output: - file("*.html") into datacheckreport + } - script: - """ - cp ${params.vampdir}/bin/vAMPirus_DC_Report.Rmd . - cp ${params.vampdir}/example_data/conf/vamplogo.png . - Rscript -e "rmarkdown::render('vAMPirus_DC_Report.Rmd',output_file='${params.projtag}_DataCheck_Report.html')" ${params.projtag} \ - ${fastpcsv} \ - ${reads_per_sample_preFilt} \ - ${read_per_sample_postFilt} \ - ${preFilt_baseFrequency} \ - ${postFilt_baseFrequency} \ - ${preFilt_qualityScore} \ - ${postFilt_qualityScore} \ - ${preFilt_averageQuality} \ - ${postFilt_averageQuaulity} \ - ${preFilt_length} \ - ${postFilt_length} \ - ${number_per_percentage_nucl} \ - ${number_per_percentage_prot} - """ - } + } } else { println("\n\t\033[0;31mMandatory argument not specified. For more info use `nextflow run vAMPirus.nf --help`\n\033[0m") exit 0 } - -workflow.onComplete { - log.info ( workflow.success ? \ - "---------------------------------------------------------------------------------" \ - + "\n\033[0;32mDone! Open the following reports in your browser\033[0m" \ - + "\n\033[0;32mPipeline performance report: ${params.workingdir}/${params.outdir}/${params.tracedir}/vampirus_report.html\033[0m" \ - + "\n\033[0;32mvAMPirus --DataCheck interactive report: ${params.workingdir}/${params.outdir}/DataCheck/*.hmtl\033[0m" \ - + "\n\033[0;32mvAMPirus --Analyze interactive report: ${params.workingdir}/${params.outdir}/Analyze/*.hmtl\033[0m" \ - : \ - "---------------------------------------------------------------------------------" \ - + "\n\033[0;31mSomething went wrong. Check error message below and/or log files.\033[0m" ) +if (params.DataCheck) { + workflow.onComplete { + log.info ( workflow.success ? \ + "---------------------------------------------------------------------------------" \ + + "\n\033[0;32mDone! Open the following reports in your browser\033[0m" \ + + "\n\033[0;32mPipeline performance report: ${params.workingdir}/${params.outdir}/${params.tracedir}/vampirus_report.html\033[0m" \ + + "\n\033[0;32mvAMPirus --DataCheck interactive report: ${params.workingdir}/${params.outdir}/DataCheck/Report/*.hmtl\033[0m" \ + : \ + "---------------------------------------------------------------------------------" \ + + "\n\033[0;31mSomething went wrong. Check error message below and/or log files.\033[0m" ) + } +} else if (params.Analyze) { + workflow.onComplete { + log.info ( workflow.success ? \ + "---------------------------------------------------------------------------------" \ + + "\n\033[0;32mDone! Open the following reports in your browser\033[0m" \ + + "\n\033[0;32mPipeline performance report: ${params.workingdir}/${params.outdir}/${params.tracedir}/vampirus_report.html\033[0m" \ + + "\n\033[0;32mvAMPirus --Analyze interactive report: ${params.workingdir}/${params.outdir}/Analyze/*.hmtl\033[0m" \ + : \ + "---------------------------------------------------------------------------------" \ + + "\n\033[0;31mSomething went wrong. Check error message below and/or log files.\033[0m" ) + } } diff --git a/vampirus.config b/vampirus.config index 33f6731..d580af1 100644 --- a/vampirus.config +++ b/vampirus.config @@ -1,10 +1,10 @@ /* -================================================================================================ - Configuration File vAMPirus -================================================================================================ - vAMPirus - Author: Alex J. Veglia and Ramón Rivera-Vicéns ------------------------------------------------------------------------------------------------- +============================================================================================================================================================= + Configuration File vAMPirus +============================================================================================================================================================= + vAMPirus + Author: Alex J. Veglia and Ramón Rivera-Vicéns +------------------------------------------------------------------------------------------------------------------------------------------------------------- */ params { @@ -23,24 +23,15 @@ params { // Name of directory created to store output of vAMPirus analyses (Nextflow will create this directory in the working directory) outdir="results" - // Merged read length filtering parameters - - // Minimum merged read length - reads below the specified maximum read length will be used for counts only - minLen="400" - // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs (safe to set at expected amplicon size to start) - maxLen="420" - // Maximum expected error for vsearch merge command - maxEE="1" - // Primer Removal parameters // If not specifying primer sequences, forward and reverse reads will be trimmed by number of bases specified using "--GlobTrim #basesfromforward,#basesfromreverse" GlobTrim="" - // Specific primer sequence on forward reads to be removed + // Specific primer sequence on forward reads to be removed -- NOTE - bbduk.sh which is used to trim the primers does not recognize Inosine (I) in the primer sequence, replace "I" with "N" in the sequence. It recognizes all other IUPAC degenerate base codes. fwd="" - // Reverse primer sequence + // Reverse primer sequence -- NOTE - bbduk.sh which is used to trim the primers does not recognize Inosine (I) in the primer sequence, replace "I" with "N" in the sequence. It recognizes all other IUPAC degenerate base codes. rev="" - // Path to fasta file with primer sequences to remove (need to specify if using --multi option ) + // Path to fasta file with primer sequences to remove (need to specify if using --multi option ) -- NOTE - bbduk.sh which is used to trim the primers does not recognize Inosine (I) in the primer sequence, replace "I" with "N" in the sequence. It recognizes all other IUPAC degenerate base codes. primers="/PATH/TO/PRIMERS.fasta" // Primer length (default 26)- If trimming primers with the --multi option or by specifying primer sequences above, change this to the length of the longer of the two primer sequences primerLength="26" @@ -48,30 +39,76 @@ params { maxkmer="13" // Minimum kmer length for primer removal (default = 3) minkmer="3" - // Minimum read length after adapter and primer removal (default = 200) - minilen="200" + // Minimum non-merged read length after adapter and primer removal (default = 200) + minilen="100" + + + // Merged read length filtering parameters + + // Minimum merged read length - reads with lengths greater than minLen and below the specified maximum read length will be used for counts only + minLen="400" + // Maximum merged read length - reads with length equal to the specified max read length will be used to generate uniques and ASVs (safe to set at expected amplicon size to start) + maxLen="420" + // Maximum expected error for vsearch merge command - vsearch discard sequences with more than the specified number of expected errors + maxEE="3" + // Maximum number of non-matching nucleotides allowed in overlap region + diffs="10" + // Maximum number of "N"'s in a sequence - if above the specified value, sequence will be discarded + maxn="20" + // Minimum length of overlap for sequence merging to occur for a pair + minoverlap="10" // ASV generation and clustering parameters // Alpha value for denoising - the higher the alpha the higher the chance of false positives in ASV generation (1 or 2) alpha="1" - // Minimum size or representation for sequence to be considered in ASV generation + // Minimum size or representation in dataset for sequence to be considered in ASV generation (ex. If set to 4, any unique sequence that is not seen in the data more 3 times is removed) minSize="8" - // Percent similarity to cluster nucleotide ASV sequences - clusterNuclID=".85" - // List of percent similarities to cluster nucleotide ASV sequences - must be separated by a comma (ex. ".95,.96") + // Percent similarity to cluster nucleotide ASV sequences (used when --ncASV is set) + clusterNuclID="85" + // List of percent similarities to cluster nucleotide ASV sequences - must be separated by a comma (ex. "95,96") clusterNuclIDlist="" // Default percent similarity to cluster aminoacid sequences - clusterAAID=".97" - // List of percent similarities to cluster aminoacid sequences - must be separated by ".95,.96" + clusterAAID="97" + // List of percent similarities to cluster aminoacid sequences - must be separated by "95,96" clusterAAIDlist="" - // Minimum length of amino acid translation to be considered during protein clustered ASV (pcASV) generation. Recommended to put this at the expected aminoacid sequence length based on your maximum read length (e.g. if maxLen="420", then minAA should be 420/3 or 140) + // Minimum length of amino acid translation to be considered during protein clustered ASV (pcASV) generation. Recommended to put this at the expected amino acid sequence length based on your maximum read length (e.g. if maxLen="420", then minAA should be 420/3 so 140) minAA="140" + // ASV filtering parameters - You can set the filtering to run with the command --filter + + // Path to database containing sequences that if ASVs match, are then removed prior to any analyses + filtDB="" + // Path to database containing sequences that if ASVs match to, are kept for final ASV file to be used in susequent analyses + keepDB="" + // Keep any sequences without hits - for yes, set keepnohit to ="true" + keepnohit="true" + + //Parameters for diamond command for filtering + + // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment + filtminID="80" + // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment + filtminaln="30" + // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive) + filtsensitivity="ultra-sensitive" + // Set the max e-value for best hit to be recorded + filtevalue="0.001" + + + // Minimum Entropy Decomposition (MED) parameters for clustering (https://merenlab.org/2012/05/11/oligotyping-pipeline-explained/) + + // If you plan to do MED on ASVs using the option "--asvMED" you can set here the number of entopy peak positions or a comma seperated list of biologically meaningful positons (e.g. 35,122,21) for oligotyping to take into consideration. If you want to use a single specific position, make "asvSingle="true"". + asvC="" + asvSingle="false" + // If you plan to do MED on ASVs using the option "--aminoMED" you can set here the number of positions for oligotyping to take into consideration. If you want to use a single specific position, make "aminoSingle="true"". + aminoC="" + aminoSingle="false" + // Counts table generation parameters // Percent similarity to use for ASV/cASV counts table generation with vsearch - asvcountID=".97" + asvcountID="97" // Parameters for protein counts table generation // Minimum Bitscore for counts ProtCountsBit="50" @@ -82,22 +119,41 @@ params { // Taxonomy inference parameters - // Specify name of database to use for analysis - dbname="DATABASENAME" - // Path to Directory where database is being stored - dbdir="DATABASEDIR" - // Toggle use of RefSeq header format; default is Reverence Viral DataBase (RVDB) - refseq="F" - // Set minimum bitscore for best hit in taxonomy assignment - bitscore="50" - // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment - minID="80" - // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment - minaln="30" + //Parameters for diamond command + // Set which measurement to use for a minimum threshold in taxonomy inference - must be either "evalue" or "bitscore" + measurement="bitscore" + // Set maximum e-value for hits to be counted + evalue="0.001" + // Set minimum bitscore for best hit in taxonomy assignment (default = 30) + bitscore="30" + // Set minimum percent amino acid similarity for best hit to be counted in taxonomy assignment + minID="40" + // Set minimum amino acid alignment length for best hit to be counted in taxonomy assignment + minaln="30" + // Set sensitivity parameters for DIAMOND aligner (read more here: https://github.com/bbuchfink/diamond/wiki; default = ultra-sensitive) + sensitivity="ultra-sensitive" + + // Database information + // Specify name of database to use for analysis + dbname="DATABASENAME" + // Path to Directory where database is being stored - vAMPirus will look here to make sure the database with the name provided above is present and built + dbdir="DATABASEDIR" + // Set database type (NCBI or RVDB). Lets vAMPirus know which sequence header format is being used and must be set to NCBI when using RefSeq or Non-Redundant databases. -> dbtype="NCBI" to toggle use of RefSeq header format; set to "RVDB" to signal the use of Reverence Viral DataBase (RVDB) headers (see manual) + dbtype="TYPE" + // Classification settings - if planning on inferring LCA from RVDB annotation files OR using NCBI taxonomy files, confirm options below are accurate. + // Path to directory RVDB hmm annotation .txt file - see manual for information on this. Leave as is if not planning on using RVDB LCA. + dbanno="DATABASEANNOT" + // Set lca="T" if you would like to add "Least Common Ancestor" classifications to taxonomy results using information provided by RVDB annotation files (works when using NCBI or RVDB databases) - example: "ASV1, Viruses::Duplodnaviria::Heunggongvirae::Peploviricota::Herviviricetes::Herpesvirales::Herpesviridae::Gammaherpesvirinae::Macavirus" + lca="LCA" + // DIAMOND taxonomy inference using NCBI taxmap files (can be downloaded using the startup script using the option -t); set to "true" for this to run (ONLY WORKS WITH dbtype="NCBI") + ncbitax="false" // Phylogeny analysis parameters + // Color nodes on phylogenetic tree in Analyze report with MED Group information (nodeCol="MED") or taxonomy (nodeCol=TAX) hit. If you would like nodes colored by sequence ID, leave nodeCol="" below. + nodeCol="" + // Customs options for IQ-TREE (Example: "-option1 A -option2 B -option3 C -option4 D") iqCustomnt="" iqCustomaa="" @@ -114,19 +170,21 @@ params { boots="1000" // Stats options - // Tell vAMPirus to perform statistical analyses by setting "stats="run"" below or in the launch command by adding "--stats run" to it - stats=false + // Tell vAMPirus to perform statistical analyses by setting "stats = true" below or in the launch command by adding "--stats" to it + stats = false // Minimum number of hit counts for a sample to have to be included in the downstream statistical analyses and report generation minimumCounts="1000" // Maximum number of iteration performed by metaMDS trymax="900" + // Conda env PATH (added automatically by startup script) + condaDir="CONDADIR" /* -// ------------------------------------ STOP ------------------------------------ // -// ---------------------- Do not modify variables below this line. ---------------------- // -// ------------------------- Proceed to modify processes at end ------------------------- // -// ------------------------------- If needed ------------------------------- // +// ---------------------------------------------------------------------- STOP ---------------------------------------------------------------------- // +// -------------------------------------------------------- Do not modify variables below this line. -------------------------------------------------------- // +// ----------------------------------------------------------- Proceed to modify processes at end ----------------------------------------------------------- // +// ----------------------------------------------------------------- If needed ----------------------------------------------------------------- // */ // Path to vAMPirus installation directory, will be filled automatically when startup script is run, otherwise, edit below @@ -137,30 +195,39 @@ params { // Manadotory arguments Analyze=false DataCheck=false -// Clustering options +// Non-Mandatory options // Cluster nucleotide sequences (ncASVs) ncASV = false // Cluster by aminoacid translations and generate protein-based OTUs (pcASVs) pcASV = false + // Generate virus types with MED of ASV sequences + asvMED = false + // Generate virus types with MED of ASV sequences + aminoMED = false + // Filter ASVs + filter = false // Skip options // Skip all Read Processing steps - skipReadProcessing=false + skipReadProcessing = false // Skip quality control processes only skipFastQC = false // Skip adapter removal process only - skipAdapterRemoval=false + skipAdapterRemoval = false // Skip primer removal process only - skipPrimerRemoval=false + skipPrimerRemoval = false // Skip AminoTyping - skipAminoTyping=false + skipAminoTyping = false // Skip Taxonomy - skipTaxonomy=false + skipTaxonomy = false // Skip phylogeny skipPhylogeny = false // Skip EMBOSS analyses skipEMBOSS = false // Skip Reports skipReport = false + // Skip Merging steps -> will also skip all read Processing + skipMerging = false + // Data check parameters datacheckntIDlist=".55,.65,.75,.80,.81,.82,.83,.84,.85,.86,.87,.88,.89,.90,.91,.92,.93,.94,.95,.96,.97,.98,.99" datacheckaaIDlist=".55,.65,.75,.80,.81,.82,.83,.84,.85,.86,.87,.88,.89,.90,.91,.92,.93,.94,.95,.96,.97,.98,.99,1.0" @@ -253,7 +320,7 @@ profiles { params.condaActivate = true // cache for condaEnv created individually conda.cacheDir = "${params.localCacheDir}/condaEnv/" - process.conda = "CONDADIR" + process.conda = "${params.condaDir}" } docker { docker.enabled = true @@ -283,6 +350,6 @@ manifest { author = 'Alex J. Veglia,Ramón Rivera-Vicéns' description = 'Automated virus amplicon sequencing analysis program' mainScript = 'vAMPirus.nf' - nextflowVersion = '>=20.06.0' - version = '1.0.1' + nextflowVersion = '>=21.04.1' + version = '2.0.0' } diff --git a/vampirus_env.yml b/vampirus_env.yml index 95423f3..cb10503 100644 --- a/vampirus_env.yml +++ b/vampirus_env.yml @@ -9,16 +9,16 @@ channels: - genomedk dependencies: - python=3.6 - - diamond=0.9.30 + - blast=2.11.0 + - diamond=2.0.11 - fastqc=0.11.9 - fastp=0.20.1 - clustalo=1.2.4 - iqtree=2.0.3 - modeltest-ng=0.1.6 - - mafft=7.464 - vsearch=2.14.2 - biopython=1.76 - - bbmap=38.79 + - bbmap=38.90 - trimal=1.4.1 - cd-hit=4.8.1 - emboss=6.5.7.0 @@ -36,3 +36,5 @@ dependencies: - bioconductor-biocparallel=1.24.0 - pigz=2.4 - r-biocmanager=1.30.10 + - pip: + - oligotyping diff --git a/vampirus_startup.sh b/vampirus_startup.sh index 69db61d..bb16b21 100644 --- a/vampirus_startup.sh +++ b/vampirus_startup.sh @@ -24,15 +24,18 @@ vampirus_startup.sh -h [-d 1|2|3|4] [-s] [ -s ] Set this option to skip conda installation and environment set up (you can use if you plan to run with Singularity and the vAMPirus Docker container) + [ -t ] Set this option to download NCBI taxonomy files needed for DIAMOND to assign taxonomic classification to sequences (works with NCBI type databases only, see manual for more information) + " } -while getopts "hsd:" OPTION; do +while getopts "hstd:" OPTION; do case $OPTION in h) usage; exit;; d) DATABASE=${OPTARG};; s) CONDA="no";; + t) TAX="yes";; esac done shift $((OPTIND-1)) # required, to "eat" the options that have been processed @@ -143,7 +146,7 @@ nextflow_c() { case $ans in [yY] | [yY][eE][sS]) echo "Awesome,starting Nextflow installation now ..." - curl -s https://get.nextflow.io | bash + curl -fsSL get.nextflow.io | bash echo "Nextflow installation finished, execultable in "$mypwd"" ;; [nN] | [nN][oO]) @@ -170,11 +173,12 @@ else echo "Alright, lets check your system for Conda..." conda_c echo "Editing path to conda directory in vampirus.config" - environment="$(conda env list | sed 's/*//g' | grep "vAMPirus" | head -1 | awk '{print $2}')" + environment="$(conda info -e | awk '$1 == "vAMPirus" {print $2}')" sed "s|CONDADIR|${environment}|g" "$mypwd"/vampirus.config > tmp1.config cat tmp1.config > "$mypwd"/vampirus.config rm tmp1.config fi + echo "-------------------------------------------------------------------------------- Conda check/install done" echo "Now lets check the status of Nextflow on your system..." @@ -186,15 +190,20 @@ if [[ $DATABASE -eq 1 ]] then mkdir "$mypwd"/Databases cd "$mypwd"/Databases dir="$(pwd)" - echo "Database installation: RVDB version 20.0 (latest as of 2020-09)" - curl -o U-RVDBv20.0-prot.fasta.bz2 https://rvdb-prot.pasteur.fr/files/U-RVDBv20.0-prot.fasta.bz2 - bunzip2 U-RVDBv20.0-prot.fasta.bz2 + echo "Database installation: RVDB version 21.0 (latest as of 2021-02)" + curl -o U-RVDBv21.0-prot.fasta.xz https://rvdb-prot.pasteur.fr/files/U-RVDBv21.0-prot.fasta.xz + xz -d U-RVDBv21.0-prot.fasta.xz + curl -o U-RVDBv21.0-prot-hmm-txt.zip https://rvdb-prot.pasteur.fr/files/U-RVDBv21.0-prot-hmm-txt.zip + unzip U-RVDBv21.0-prot-hmm-txt.zip + mv annot ./RVDBannot/ echo "Editing confiration file for you now..." - sed 's/DATABASENAME/U-RVDBv19.0-prot.fasta/g' "$mypwd"/vampirus.config > tmp1.config + sed 's/DATABASENAME/U-RVDBv21.0-prot.fasta/g' "$mypwd"/vampirus.config > tmp1.config sed "s|DATABASEDIR|${dir}|g" tmp1.config > tmp2.config + sed "s|DATABASEANNOT|${dir}/RVDBannot|g" tmp2.config | sed 's/TYPE/RVDB/g' > tmp3.config rm tmp1.config - cat tmp2.config > "$mypwd"/vampirus.config rm tmp2.config + cat tmp3.config > "$mypwd"/vampirus.config + rm tmp3.config echo "Database downloaded and configuration file edited, you should confirm the path and database name was set correctly in the config file." elif [[ $DATABASE -eq 2 ]] then mkdir "$mypwd"/Databases @@ -202,10 +211,21 @@ then mkdir "$mypwd"/Databases dir="$(pwd)" echo "Database installation: Viral RefSeq database version 2.0 (latest as of 2020-07)" curl -o viral.2.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.protein.faa.gz + curl -o viral.1.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz + curl -o viral.3.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.3.protein.faa.gz + gunzip viral.1.protein.faa.gz + cat viral.1.protein.faa >> complete_virus_refseq_prot.fasta gunzip viral.2.protein.faa.gz + cat viral.2.protein.faa >> complete_virus_refseq_prot.fasta + gunzip viral.3.protein.faa.gz + cat viral.3.protein.faa >> complete_virus_refseq_prot.fasta + rm viral.*.protein.faa + curl -o U-RVDBv21.0-prot-hmm-txt.zip https://rvdb-prot.pasteur.fr/files/U-RVDBv21.0-prot-hmm-txt.zip + unzip U-RVDBv21.0-prot-hmm-txt.zip + mv annot ./RVDBannot echo "Editing confiration file for you now..." - sed 's/DATABASENAME/viral.2.protein.faa/g' "$mypwd"/vampirus.config > tmp1.config - sed "s|DATABASEDIR|${dir}|g" tmp1.config > tmp2.config + sed 's/DATABASENAME/complete_virus_refseq_prot.fasta/g' "$mypwd"/vampirus.config > tmp1.config + sed "s|DATABASEDIR|${dir}|g" tmp1.config | sed "s|DATABASEANNOT|${dir}/RVDBannot|g" | sed 's/TYPE/NCBI/g' > tmp2.config rm tmp1.config cat tmp2.config > "$mypwd"/vampirus.config rm tmp2.config @@ -217,9 +237,12 @@ then mkdir "$mypwd"/Databases echo "Database installation: NCBI NR protein database (should be the most up to date at time of running this script)" curl -o NCBI_nr_proteindb.faa.gz https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz gunzip NCBI_nr_proteindb.faa.gz + curl -o U-RVDBv21.0-prot-hmm-txt.zip https://rvdb-prot.pasteur.fr/files/U-RVDBv21.0-prot-hmm-txt.zip + unzip U-RVDBv21.0-prot-hmm-txt.zip + mv annot ./RVDBannot echo "Editing confiration file for you now..." sed 's/DATABASENAME/NCBI_nr_proteindb.faa/g' "$mypwd"/vampirus.config > tmp1.config - sed "s|DATABASEDIR|${dir}|g" tmp1.config > tmp2.config + sed "s|DATABASEDIR|${dir}|g" tmp1.config | sed "s|DATABASEANNOT|${dir}/RVDBannot|g" | sed 's/TYPE/NCBI/g' > tmp2.config rm tmp1.config cat tmp2.config > "$mypwd"/vampirus.config rm tmp2.config @@ -231,6 +254,8 @@ then mkdir "$mypwd"/Databases echo "Database installation: We want 'em all! Might take a little while....'" curl -o NCBI_nr_proteindb.faa.gz https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz curl -o viral.2.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.protein.faa.gz + curl -o viral.1.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz + curl -o viral.3.protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.3.protein.faa.gz curl -o U-RVDBv19.0-prot.fasta.bz2 https://rvdb-prot.pasteur.fr/files/U-RVDBv19.0-prot.fasta.bz2 sed "s|DATABASEDIR|${dir}|g" "$mypwd"/vampirus.config > tmp1.config cat tmp1.config > "$mypwd"/vampirus.config @@ -240,60 +265,73 @@ elif [[ $DATABASE != "" ]] then echo "Error: Database download signaled but not given a value between 1-4" exit 1 fi + +if [[ "$TAX" == "yes" ]] +then mkdir "$mypwd"/Databases/NCBItaxonomy + cd "$mypwd"/Databases/NCBItaxonomy + curl -o prot.accession2taxid.FULL.gz ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz + echo "Gunzipping accession2tax map, might take a moment.." + gunzip prot.accession2taxid.FULL.gz + curl -o taxdmp.zip ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip + unzip taxdmp.zip +fi + echo "-------------------------------------------------------------------------------- Database loop done" cd "$mypwd" -echo "Ok, everything downloaded. To test installation, check out the STARTUP_HELP.txt file within "$mypwd" for instructions for testing the installation and running vAMPirus with your own data." +echo "Ok, everything downloaded. To test installation, check out the EXAMPLE_COMMANDS.txt file within "$mypwd" for instructions for testing the installation and running vAMPirus with your own data." chmod a+x "$mypwd"/bin/virtualribosomev2/* +chmod a+x "$mypwd"/bin/muscle5.0.1278_linux64 + -if [[ $(ls "$mypwd"| grep -wc "STARTUP_HELP.txt") -eq 0 ]] +if [[ $(ls "$mypwd"| grep -wc "EXAMPLE_COMMANDS.txt") -eq 0 ]] then - touch STARTUP_HELP.txt - echo "-------------------------------------------------------------------------------------------------------------------------------- TESTING YOUR INSTALLATION; be sure to run from inside the vAMPirus program directory" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Ok, everything downloaded. To test installation, run the following commands and check for errors (be sure to run from inside the vAMPirus program directory):" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Checking DataCheck mode:" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile conda,test --DataCheck" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Or if you plan to run vAMPirus using Singularity, use this test command:" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile singularity,test --DataCheck" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Next, test the analysis pipeline:" >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Or if you plan to run vAMPirus using Singularity, use this test command:" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats" >> STARTUP_HELP.txt - echo "--------------------------------------------------------------------------------------------------------------------------------" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "Ok, if everything went well (green text was spit out by Nextflow), now you can move on to the fun. First, you should review the help docs and the vampirus.config in the vAMPirus directory." >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "-------------------------------------------------------------------------------------------------------------------------------- RUNNING DataCheck PIPELINE WITH YOUR DATA" >> STARTUP_HELP.txt + touch EXAMPLE_COMMANDS.txt + echo "-------------------------------------------------------------------------------------------------------------------------------- TESTING YOUR INSTALLATION; be sure to run from inside the vAMPirus program directory" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Ok, everything downloaded. To test installation, run the following commands and check for errors (be sure to run from inside the vAMPirus program directory):" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Checking DataCheck mode:" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile conda,test --DataCheck" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Or if you plan to run vAMPirus using Singularity, use this test command:" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile singularity,test --DataCheck" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Next, test the analysis pipeline:" >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile conda,test --Analyze --ncASV --pcASV --stats" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Or if you plan to run vAMPirus using Singularity, use this test command:" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile singularity,test --Analyze --ncASV --pcASV --stats" >> EXAMPLE_COMMANDS.txt + echo "--------------------------------------------------------------------------------------------------------------------------------" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "Ok, if everything went well (green text was spit out by Nextflow), now you can move on to the fun. First, you should review the help docs and the vampirus.config in the vAMPirus directory." >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "-------------------------------------------------------------------------------------------------------------------------------- RUNNING DataCheck PIPELINE WITH YOUR DATA" >> EXAMPLE_COMMANDS.txt echo "If everything looks good, here are a example lanch commands to submit after testing installation and editing the paths to your data and other parameters for the run in the vampirus.config file:" - echo " " >> STARTUP_HELP.txt - echo "First, run the DataCheck part of the pipeline using the -with-conda Nextflow option:" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -with-conda "$environment" --DataCheck" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "OR using -profile option of Nextflow ..." >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile [conda|singularity] --DataCheck" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "--------------------------------------------------------------------------------------------------------------------------------" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "-------------------------------------------------------------------------------------------------------------------------------- RUNNING Analyze PIPELINE WITH YOUR DATA" >> STARTUP_HELP.txt - echo "Then you can run the analysis using the -with-conda Nextflow option, here is a launch command to run the complete analysis and statistical tests:" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -with-conda "$environment" --Analyze --ncASV --pcASV --stats" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "OR same command using -profile option of Nextflow ..." >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats" >> STARTUP_HELP.txt - echo " " >> STARTUP_HELP.txt - echo "--------------------------------------------------------------------------------------------------------------------------------" >> STARTUP_HELP.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "First, run the DataCheck part of the pipeline using the -with-conda Nextflow option:" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -with-conda "$environment" --DataCheck" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "OR using -profile option of Nextflow ..." >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile [conda|singularity] --DataCheck" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "--------------------------------------------------------------------------------------------------------------------------------" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "-------------------------------------------------------------------------------------------------------------------------------- RUNNING Analyze PIPELINE WITH YOUR DATA" >> EXAMPLE_COMMANDS.txt + echo "Then you can run the analysis using the -with-conda Nextflow option, here is a launch command to run the complete analysis and statistical tests:" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -with-conda "$environment" --Analyze --ncASV --pcASV --stats" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "OR same command using -profile option of Nextflow ..." >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo ""$mypwd"/nextflow run "$mypwd"/vAMPirus.nf -c "$mypwd"/vampirus.config -profile [conda|singularity] --Analyze --ncASV --pcASV --stats" >> EXAMPLE_COMMANDS.txt + echo " " >> EXAMPLE_COMMANDS.txt + echo "--------------------------------------------------------------------------------------------------------------------------------" >> EXAMPLE_COMMANDS.txt fi echo " " echo "Setup script is complete!" -echo "Check out the STARTUP_HELP.txt file for more information on how to move forward with the analysis." +echo "Check out the EXAMPLE_COMMANDS.txt file for more information on how to move forward with the analysis."