Merge pull request #103 from EBI-Metagenomics/dev

Adjustment of the voting scheme and updated viral taxonomy and bugfixes for new release
EBI-Metagenomics · Jul 17, 2023 · 1207242 · 1207242
2 parents aaf02d6 + 6078c04
commit 1207242
Show file tree

Hide file tree

Showing 20 changed files with 1,393 additions and 127 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,3 @@
-![](https://img.shields.io/badge/CWL-1.2-green)
 ![](https://img.shields.io/badge/nextflow-22.04.5-brightgreen)
 ![](https://img.shields.io/badge/uses-docker-blue.svg)
 ![](https://img.shields.io/badge/uses-singularity-red.svg)
@@ -7,7 +6,7 @@
 
 1. [ The VIRify pipeline ](#virify)
 2. [ Nextflow execution ](#nf)
-3. [ CWL execution ](#cwl)
+3. [ CWL execution (discontinued) ](#cwl)
 4. [ Pipeline overview ](#overview)
 5. [ Detour: Metatranscriptomics ](#metatranscriptome)
 6. [ Resources ](#resources)
@@ -16,12 +15,14 @@
 <a name="virify"></a>
 
 # VIRify
-![Sankey plot](nextflow/figures/sankey.png)
+![Sankey plot](nextflow/figures/2023-sankey-neto.png)
 
 ## General
-VIRify is a recently developed pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and [referred to as ViPhOGs](https://doi.org/10.3390/v13061164). 
+VIRify is a pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and [referred to as ViPhOGs](https://doi.org/10.3390/v13061164). 
 
-The pipeline is implemented and available in [CWL](#cwl) and [Nextflow](#nf). You only need [CWL](#cwl) or [Nextflow](#nf) to run the pipeline (plus Docker or Singularity, as described below). For local execution and on HPCs we recommend the usage of [Nextflow](#nf). Details about installation and usage are given below.  
+The pipeline is implemented in [Nextflow](#nf) and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.
+
+**Please note**, that until v1.0 the pipeline was also implemented in [CWL](#cwl) as an alternative to [Nextflow](#nf). However, later updates were only included in the [Nextflow](#nf) version of the pipeline. 
 
 
 <a name="nf"></a>
@@ -110,9 +111,6 @@ Don't forget, especially on an HPC, to define further important parameters such
 
 The engine `conda` is not working at the moment until there is a conda recipe for PPR-Meta or we switch the tool. Sorry. Use Docker. Or Singularity. Please. Or install PPR-Meta by yourself and then use the `conda` profile (not recommended).  
 
-
-<a name="cwl"></a>
-
 ## Monitoring
 
 <img align="right" width="400px" src="figures/tower.png" alt="Monitoring with Nextflow Tower" /> 
@@ -147,9 +145,9 @@ You can also directly enter your access token here instead of generating the abo
 
 ### GFF output files
 
-The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file. You can find such output on the 08-final/gff/ folder.
+The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file. You can find such output in the `08-final/gff/` folder.
 
-The labels used in the Type column of the gff file corresponds to the following nomenclature according to the [Sequence Ontology resource](http://www.sequenceontology.org/browser/current_svn/term/SO:0000001):
+The labels used in the Type column of the gff file correspond to the following nomenclature according to the [Sequence Ontology resource](http://www.sequenceontology.org/browser/current_svn/term/SO:0000001):
 
 | Type in gff file  | Sequence ontology ID |
 | ------------- | ------------- |
@@ -159,11 +157,15 @@ The labels used in the Type column of the gff file corresponds to the following
 
 Note that CDS are reported only when a ViPhOG match has been found. 
 
-# Common Workflow Language
-VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/). 
+
+<a name="cwl"></a>
+
+# Common Workflow Language (discontinued)
+
+**Until VIRify v1.0**, VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/) next to the Nextflow implementation. Both Workflow Management Systems were previously supported. 
 
 ## What do I need?
-The current implementation uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.
+The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.
 
 ## How?
 For instructions go to the [CWL README](cwl/README.md).
@@ -196,12 +198,14 @@ Although VIRify has been benchmarked and validated with metagenomic data in mind
 
 Additional material (assemblies used for benchmarking in the paper, ...) as well as the ViPhOG HMMs with model-specific bit score thresholds used in VIRify are available at [osf.io/fbrxy](https://osf.io/fbrxy/).
 
-Here, we also list databases used and automatically downloaded by the pipeline (in v1.0.0) when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see `--help` in the Nextflow pipeline).
+Here, we also list databases used and automatically downloaded by the pipeline **(in v2.0.0)** when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see `--help` in the Nextflow pipeline).
 
 ### Virus-specific protein profile HMMs
 
 * **ViPhOGs** (mandatory, used for taxonomy assignment)
     * `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpHMM_database_v3.tar.gz`
+    * Additional metadata file for filtering the ViPhOGs (according to taxonomy updates by the [ICTV](https://ictv.global/taxonomy))
+        * `wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/additional_data_vpHMMs_v4.tsv`
     * [Publication](https://www.mdpi.com/1999-4915/13/6/1164)
 * **pVOGs** (optional)
     * `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/pvogs.tar.gz`
@@ -234,7 +238,7 @@ Here, we also list databases used and automatically downloaded by the pipeline (
 ### Taxonomy annotation
 
 * **NCBI taxonomy**
-    * `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2020-07-01_ete3_ncbi_tax.sqlite.gz`
+    * `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2022-11-01_ete3_ncbi_tax.sqlite.gz`
 
 ### Additional blast-based assignment (optional, super slow)
 

diff --git a/bin/contig_taxonomic_assign.py b/bin/contig_taxonomic_assign.py
@@ -2,31 +2,64 @@
 
 import argparse
 import csv
-import glob
-import operator
 import os
 import re
-import sys
+import math
 from collections import Counter
 
 import pandas as pd
 from ete3 import NCBITaxa
 
 
-def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
+def main(args):
+    """Generate tabular file with taxonomic assignment of viral contigs based on ViPhOG annotations"""
+    input_df = pd.read_csv(args.input_file, sep="\t")
+
+    factor_file = args.factor
+
+    with open(factor_file, newline="") as infile:
+        csv_reader = csv.reader(infile)
+        next(csv_reader)
+        factor_dict = {
+            name: {
+                "avg_cds": float(avg_cds),
+                "std_cds": float(std_cds),
+                "factor": float(mult_factor),
+            }
+            for name, *_, avg_cds, std_cds, _, mult_factor in csv_reader
+        }
+
+    file_header = ["contig_ID", "genus", "subfamily", "family", "order", "class"]
+
+    output_gen = contig_tax(input_df, args.ncbi_db, args.tax_thres, factor_dict)
+
+    print(args.input_file)
+
+    out_file = re.split(r"\.[a-z]+$", os.path.basename(args.input_file))[0]
+
+    with open(
+        os.path.join(args.outdir, out_file + "_taxonomy.tsv"), "w", newline=""
+    ) as output_file:
+        tsv_writer = csv.writer(output_file, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
+        tsv_writer.writerow(file_header)
+        for item in output_gen:
+            tsv_writer.writerow(item)
+
+
+def contig_tax(annot_df, ncbi_db, tax_thres, taxon_factor_dict):
     """This function takes the annotation table generated by viral_contig_maps.py and generates a table that
     provides the taxonomic lineage of each viral contig, based on the corresponding ViPhOG annotations"""
 
     ncbi = NCBITaxa(dbfile=ncbi_db)
-    tax_rank_order = ["genus", "subfamily", "family", "order"]
+    tax_rank_order = ["genus", "subfamily", "family", "order", "class"]
     contig_set = set(annot_df["Contig"])
 
     for contig in contig_set:
         contig_lineage = [contig]
         contig_df = annot_df[annot_df["Contig"] == contig]
         total_prot = len(contig_df)
         annot_prot = sum(contig_df["Best_hit"] != "No hit")
-        if annot_prot < prop_annot * total_prot:
+        if annot_prot == 0:
             contig_lineage.extend([""] * 4)
         else:
             contig_hits = contig_df[pd.notnull(contig_df["Label"])]["Label"].values
@@ -35,49 +68,74 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
             ]
             hit_lineages = [
                 {
-                    y: x
+                    y: ncbi.get_taxid_translator([x])[x]
                     for x, y in ncbi.get_rank(ncbi.get_lineage(item)).items()
-                    if y in tax_rank_order
+                    if y in tax_rank_order[:-1]
                 }
                 for item in taxid_list
             ]
-            for rank in tax_rank_order:
+            for rank in tax_rank_order[:-1]:
                 taxon_list = [item.get(rank) for item in hit_lineages]
                 total_hits = sum(pd.notnull(taxon_list))
-                if total_hits < min_prot:
+                if total_hits == 0:
                     contig_lineage.append("")
                     continue
                 else:
                     count_hits = Counter(
                         [item for item in taxon_list if pd.notnull(item)]
                     )
-                    best_hit = sorted(
-                        [(x, y) for x, y in count_hits.items()],
-                        key=lambda x: x[1],
-                        reverse=True,
-                    )[0]
-                    prop_hits = best_hit[1] / total_hits
-                    if prop_hits < tax_thres:
-                        contig_lineage.append(prop_hits)
-                        continue
+                    under_thres = []
+                    over_thres = []
+                    for hit_taxon, hit_count in count_hits.items():
+                        prop_hits = hit_count / total_hits
+                        taxon_factor = (
+                            taxon_factor_dict[hit_taxon]["factor"]
+                            if hit_taxon in taxon_factor_dict.keys()
+                            else 1
+                        )
+                        if prop_hits < tax_thres * taxon_factor:
+                            hit_bound = math.ceil(tax_thres * taxon_factor * total_hits)
+                            hit_diff = hit_bound - hit_count
+                            under_thres.append((hit_taxon, hit_diff))
+                        else:
+                            over_thres.append((hit_taxon, prop_hits))
+                    if len(over_thres) == 0:
+                        best_under = sorted(under_thres, key=lambda x: x[1])[0]
+                        contig_lineage.append(best_under[1])
                     else:
-                        best_lineage = ncbi.get_lineage(best_hit[0])
-                        contig_lineage.extend(
-                            [
-                                ncbi.get_taxid_translator([key])[key]
-                                if pd.notnull(key)
-                                else ""
-                                for key in [
-                                    {
-                                        y: x
-                                        for x, y in ncbi.get_rank(best_lineage).items()
-                                    }.get(item)
+                        sorted_over_thres = [
+                            x
+                            for x, y in sorted(
+                                over_thres, key=lambda x: x[1], reverse=True
+                            )
+                        ]
+                        for taxon in sorted_over_thres:
+                            if (
+                                taxon in taxon_factor_dict.keys()
+                                and total_prot
+                                <= taxon_factor_dict[taxon].get("avg_cds")
+                                + 2 * taxon_factor_dict[taxon].get("std_cds")
+                            ):
+                                taxon_taxid = ncbi.get_name_translator([taxon])[taxon][
+                                    0
+                                ]
+                                taxon_lineage = ncbi.get_lineage(taxon_taxid)
+                                taxon_lineage_dict = {
+                                    y: ncbi.get_taxid_translator([x])[x]
+                                    for x, y in ncbi.get_rank(taxon_lineage).items()
+                                    if y in tax_rank_order[tax_rank_order.index(rank) :]
+                                }
+                                taxon_lineage_list = [
+                                    taxon_lineage_dict.get(item, "")
                                     for item in tax_rank_order[
                                         tax_rank_order.index(rank) :
                                     ]
                                 ]
-                            ]
-                        )
+                                break
+                        else:
+                            contig_lineage.append("")
+                            continue
+                        contig_lineage.extend(taxon_lineage_list)
                         break
         yield contig_lineage
 
@@ -101,18 +159,10 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
         required=True,
     )
     parser.add_argument(
-        "--minprot",
-        dest="min_prot",
-        type=int,
-        help="Minimum number of proteins with ViPhOG annotations at each taxonomic level, required for taxonomic assignment (default: 2)",
-        default=2,
-    )
-    parser.add_argument(
-        "--prop",
-        dest="prot_prop",
-        type=float,
-        help="Minimum proportion of proteins in a contig that must have a ViPhOG annotation in order to provide a taxonomic assignment (default: 0.1)",
-        default=0.1,
+        "-f",
+        "--factor",
+        dest="factor",
+        help="File containing viphog-specific multiplication factors.",
     )
     parser.add_argument(
         "--taxthres",
@@ -129,19 +179,5 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
         default=".",
     )
     args = parser.parse_args()
-    input_df = pd.read_csv(args.input_file, sep="\t")
-    file_header = ["contig_ID", "genus", "subfamily", "family", "order"]
-    output_gen = contig_tax(
-        input_df, args.ncbi_db, args.min_prot, args.prot_prop, args.tax_thres
-    )
-    print(args.input_file)
 
-    out_file = re.split(r"\.[a-z]+$", os.path.basename(args.input_file))[0]
-
-    with open(
-        os.path.join(args.outdir, out_file + "_taxonomy.tsv"), "w", newline=""
-    ) as output_file:
-        tsv_writer = csv.writer(output_file, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
-        tsv_writer.writerow(file_header)
-        for item in output_gen:
-            tsv_writer.writerow(item)
+    main(args)
diff --git a/bin/get_taxdb_file.py b/bin/get_taxdb_file.py
diff --git a/bin/ratio_evalue_table.py b/bin/ratio_evalue_table.py
@@ -106,7 +106,7 @@ def ratio_evalue(vphmm_df, taxa_dict, evalue):
                         help="domtbl generated with Generate_vphmm_hmmer_matrix.py",
                         required=True)
     parser.add_argument("-t", "--taxa", dest="taxa_tsv",
-                        help="TSV file: additional_data_vpHMMs_v{1,2}.tsv", required=True)
+                        help="TSV file: additional_data_vpHMMs_v{1,2,3,4}.tsv", required=True)
     parser.add_argument("-o", "--outfile", dest="out_file",
                         help="Output table name (default: cwd)",
                         default=".")

diff --git a/containers/r_chromomap/Dockerfile b/containers/r_chromomap/Dockerfile
@@ -1,13 +1,13 @@
-FROM rocker/r-ver:3.5.0
+FROM rocker/r-ver:4.3
 
-LABEL base_image="rocker/verse:3.5.0"
-LABEL version="0.2"
+LABEL base_image="rocker/verse:4.3"
+LABEL version="0.3"
 LABEL about.summary="r visualization packages"
 LABEL about.license="SPDX:Apache-2.0"
 LABEL about.tags="r, visualization"
 LABEL about.home="https://cran.r-project.org/web/packages/chromoMap/, https://cran.r-project.org/web/packages/ggplot2/, https://cran.r-project.org/web/packages/plotly/"
 LABEL software="r_chromoMap"
-LABEL software.version="3.15"
+LABEL software.version="4.1.1"
 
 LABEL maintainer="MGnify team https://www.ebi.ac.uk/support/metagenomics"
 

diff --git a/nextflow.config b/nextflow.config
@@ -40,14 +40,15 @@ params {
     evalue = 0.01
     prop = 0.1
     taxthres = 0.6
+    factor = "$projectDir/references/viphogs_cds_per_taxon_cummulative.csv"
 
     sankey = 25
     chunk = 10
     length = 1.5 // in kb, so 0.5 filters all contigs shorter 500 nt
 
     // development
     viphog_version = 'v3'
-    meta_version = 'v2'
+    meta_version = 'v4'
 
     // folder structure
     output = 'results'

diff --git a/nextflow/configs/container.config b/nextflow/configs/container.config
@@ -15,7 +15,7 @@ process {
     withLabel: ruby {           container = 'quay.io/microbiome-informatics/bioruby:2.0.1' } 
     withLabel: pprmeta {        container = 'quay.io/microbiome-informatics/pprmeta:1.1' }
     withLabel: prodigal {       container = 'quay.io/biocontainers/prodigal:2.6.3--hec16e2b_4' }
-    withLabel: chromomap {      container = 'quay.io/microbiome-informatics/r_chromomap:0.2' } 
+    withLabel: chromomap {      container = 'quay.io/microbiome-informatics/r_chromomap:0.3' } 
     withLabel: multiqc {        container = 'quay.io/biocontainers/multiqc:1.9--py_1' }
     withLabel: fastqc {         container = 'quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1' } 
     withLabel: basics {         container = 'quay.io/microbiome-informatics/virify-basics:1.0' } 

diff --git a/nextflow/figures/2023-sankey-neto.png b/nextflow/figures/2023-sankey-neto.png
diff --git a/nextflow/figures/virify_fig1_workflow.png b/nextflow/figures/virify_fig1_workflow.png