Skip to content

Commit

Permalink
Merge pull request #103 from EBI-Metagenomics/dev
Browse files Browse the repository at this point in the history
Adjustment of the voting scheme and updated viral taxonomy and bugfixes for new release
  • Loading branch information
mberacochea authored Jul 17, 2023
2 parents aaf02d6 + 6078c04 commit 1207242
Show file tree
Hide file tree
Showing 20 changed files with 1,393 additions and 127 deletions.
34 changes: 19 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
![](https://img.shields.io/badge/CWL-1.2-green)
![](https://img.shields.io/badge/nextflow-22.04.5-brightgreen)
![](https://img.shields.io/badge/uses-docker-blue.svg)
![](https://img.shields.io/badge/uses-singularity-red.svg)
Expand All @@ -7,7 +6,7 @@

1. [ The VIRify pipeline ](#virify)
2. [ Nextflow execution ](#nf)
3. [ CWL execution ](#cwl)
3. [ CWL execution (discontinued) ](#cwl)
4. [ Pipeline overview ](#overview)
5. [ Detour: Metatranscriptomics ](#metatranscriptome)
6. [ Resources ](#resources)
Expand All @@ -16,12 +15,14 @@
<a name="virify"></a>

# VIRify
![Sankey plot](nextflow/figures/sankey.png)
![Sankey plot](nextflow/figures/2023-sankey-neto.png)

## General
VIRify is a recently developed pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and [referred to as ViPhOGs](https://doi.org/10.3390/v13061164).
VIRify is a pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by [MGnify](https://www.ebi.ac.uk/metagenomics/). VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and [referred to as ViPhOGs](https://doi.org/10.3390/v13061164).

The pipeline is implemented and available in [CWL](#cwl) and [Nextflow](#nf). You only need [CWL](#cwl) or [Nextflow](#nf) to run the pipeline (plus Docker or Singularity, as described below). For local execution and on HPCs we recommend the usage of [Nextflow](#nf). Details about installation and usage are given below.
The pipeline is implemented in [Nextflow](#nf) and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.

**Please note**, that until v1.0 the pipeline was also implemented in [CWL](#cwl) as an alternative to [Nextflow](#nf). However, later updates were only included in the [Nextflow](#nf) version of the pipeline.


<a name="nf"></a>
Expand Down Expand Up @@ -110,9 +111,6 @@ Don't forget, especially on an HPC, to define further important parameters such

The engine `conda` is not working at the moment until there is a conda recipe for PPR-Meta or we switch the tool. Sorry. Use Docker. Or Singularity. Please. Or install PPR-Meta by yourself and then use the `conda` profile (not recommended).


<a name="cwl"></a>

## Monitoring

<img align="right" width="400px" src="figures/tower.png" alt="Monitoring with Nextflow Tower" />
Expand Down Expand Up @@ -147,9 +145,9 @@ You can also directly enter your access token here instead of generating the abo

### GFF output files

The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file. You can find such output on the 08-final/gff/ folder.
The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file. You can find such output in the `08-final/gff/` folder.

The labels used in the Type column of the gff file corresponds to the following nomenclature according to the [Sequence Ontology resource](http://www.sequenceontology.org/browser/current_svn/term/SO:0000001):
The labels used in the Type column of the gff file correspond to the following nomenclature according to the [Sequence Ontology resource](http://www.sequenceontology.org/browser/current_svn/term/SO:0000001):

| Type in gff file | Sequence ontology ID |
| ------------- | ------------- |
Expand All @@ -159,11 +157,15 @@ The labels used in the Type column of the gff file corresponds to the following

Note that CDS are reported only when a ViPhOG match has been found.

# Common Workflow Language
VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/).

<a name="cwl"></a>

# Common Workflow Language (discontinued)

**Until VIRify v1.0**, VIRify was implemented in [Common Workflow Language (CWL)](https://www.commonwl.org/) next to the Nextflow implementation. Both Workflow Management Systems were previously supported.

## What do I need?
The current implementation uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.
The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.

## How?
For instructions go to the [CWL README](cwl/README.md).
Expand Down Expand Up @@ -196,12 +198,14 @@ Although VIRify has been benchmarked and validated with metagenomic data in mind

Additional material (assemblies used for benchmarking in the paper, ...) as well as the ViPhOG HMMs with model-specific bit score thresholds used in VIRify are available at [osf.io/fbrxy](https://osf.io/fbrxy/).

Here, we also list databases used and automatically downloaded by the pipeline (in v1.0.0) when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see `--help` in the Nextflow pipeline).
Here, we also list databases used and automatically downloaded by the pipeline **(in v2.0.0)** when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see `--help` in the Nextflow pipeline).

### Virus-specific protein profile HMMs

* **ViPhOGs** (mandatory, used for taxonomy assignment)
* `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpHMM_database_v3.tar.gz`
* Additional metadata file for filtering the ViPhOGs (according to taxonomy updates by the [ICTV](https://ictv.global/taxonomy))
* `wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/additional_data_vpHMMs_v4.tsv`
* [Publication](https://www.mdpi.com/1999-4915/13/6/1164)
* **pVOGs** (optional)
* `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/pvogs.tar.gz`
Expand Down Expand Up @@ -234,7 +238,7 @@ Here, we also list databases used and automatically downloaded by the pipeline (
### Taxonomy annotation

* **NCBI taxonomy**
* `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2020-07-01_ete3_ncbi_tax.sqlite.gz`
* `wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2022-11-01_ete3_ncbi_tax.sqlite.gz`

### Additional blast-based assignment (optional, super slow)

Expand Down
154 changes: 95 additions & 59 deletions bin/contig_taxonomic_assign.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,64 @@

import argparse
import csv
import glob
import operator
import os
import re
import sys
import math
from collections import Counter

import pandas as pd
from ete3 import NCBITaxa


def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
def main(args):
"""Generate tabular file with taxonomic assignment of viral contigs based on ViPhOG annotations"""
input_df = pd.read_csv(args.input_file, sep="\t")

factor_file = args.factor

with open(factor_file, newline="") as infile:
csv_reader = csv.reader(infile)
next(csv_reader)
factor_dict = {
name: {
"avg_cds": float(avg_cds),
"std_cds": float(std_cds),
"factor": float(mult_factor),
}
for name, *_, avg_cds, std_cds, _, mult_factor in csv_reader
}

file_header = ["contig_ID", "genus", "subfamily", "family", "order", "class"]

output_gen = contig_tax(input_df, args.ncbi_db, args.tax_thres, factor_dict)

print(args.input_file)

out_file = re.split(r"\.[a-z]+$", os.path.basename(args.input_file))[0]

with open(
os.path.join(args.outdir, out_file + "_taxonomy.tsv"), "w", newline=""
) as output_file:
tsv_writer = csv.writer(output_file, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
tsv_writer.writerow(file_header)
for item in output_gen:
tsv_writer.writerow(item)


def contig_tax(annot_df, ncbi_db, tax_thres, taxon_factor_dict):
"""This function takes the annotation table generated by viral_contig_maps.py and generates a table that
provides the taxonomic lineage of each viral contig, based on the corresponding ViPhOG annotations"""

ncbi = NCBITaxa(dbfile=ncbi_db)
tax_rank_order = ["genus", "subfamily", "family", "order"]
tax_rank_order = ["genus", "subfamily", "family", "order", "class"]
contig_set = set(annot_df["Contig"])

for contig in contig_set:
contig_lineage = [contig]
contig_df = annot_df[annot_df["Contig"] == contig]
total_prot = len(contig_df)
annot_prot = sum(contig_df["Best_hit"] != "No hit")
if annot_prot < prop_annot * total_prot:
if annot_prot == 0:
contig_lineage.extend([""] * 4)
else:
contig_hits = contig_df[pd.notnull(contig_df["Label"])]["Label"].values
Expand All @@ -35,49 +68,74 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
]
hit_lineages = [
{
y: x
y: ncbi.get_taxid_translator([x])[x]
for x, y in ncbi.get_rank(ncbi.get_lineage(item)).items()
if y in tax_rank_order
if y in tax_rank_order[:-1]
}
for item in taxid_list
]
for rank in tax_rank_order:
for rank in tax_rank_order[:-1]:
taxon_list = [item.get(rank) for item in hit_lineages]
total_hits = sum(pd.notnull(taxon_list))
if total_hits < min_prot:
if total_hits == 0:
contig_lineage.append("")
continue
else:
count_hits = Counter(
[item for item in taxon_list if pd.notnull(item)]
)
best_hit = sorted(
[(x, y) for x, y in count_hits.items()],
key=lambda x: x[1],
reverse=True,
)[0]
prop_hits = best_hit[1] / total_hits
if prop_hits < tax_thres:
contig_lineage.append(prop_hits)
continue
under_thres = []
over_thres = []
for hit_taxon, hit_count in count_hits.items():
prop_hits = hit_count / total_hits
taxon_factor = (
taxon_factor_dict[hit_taxon]["factor"]
if hit_taxon in taxon_factor_dict.keys()
else 1
)
if prop_hits < tax_thres * taxon_factor:
hit_bound = math.ceil(tax_thres * taxon_factor * total_hits)
hit_diff = hit_bound - hit_count
under_thres.append((hit_taxon, hit_diff))
else:
over_thres.append((hit_taxon, prop_hits))
if len(over_thres) == 0:
best_under = sorted(under_thres, key=lambda x: x[1])[0]
contig_lineage.append(best_under[1])
else:
best_lineage = ncbi.get_lineage(best_hit[0])
contig_lineage.extend(
[
ncbi.get_taxid_translator([key])[key]
if pd.notnull(key)
else ""
for key in [
{
y: x
for x, y in ncbi.get_rank(best_lineage).items()
}.get(item)
sorted_over_thres = [
x
for x, y in sorted(
over_thres, key=lambda x: x[1], reverse=True
)
]
for taxon in sorted_over_thres:
if (
taxon in taxon_factor_dict.keys()
and total_prot
<= taxon_factor_dict[taxon].get("avg_cds")
+ 2 * taxon_factor_dict[taxon].get("std_cds")
):
taxon_taxid = ncbi.get_name_translator([taxon])[taxon][
0
]
taxon_lineage = ncbi.get_lineage(taxon_taxid)
taxon_lineage_dict = {
y: ncbi.get_taxid_translator([x])[x]
for x, y in ncbi.get_rank(taxon_lineage).items()
if y in tax_rank_order[tax_rank_order.index(rank) :]
}
taxon_lineage_list = [
taxon_lineage_dict.get(item, "")
for item in tax_rank_order[
tax_rank_order.index(rank) :
]
]
]
)
break
else:
contig_lineage.append("")
continue
contig_lineage.extend(taxon_lineage_list)
break
yield contig_lineage

Expand All @@ -101,18 +159,10 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
required=True,
)
parser.add_argument(
"--minprot",
dest="min_prot",
type=int,
help="Minimum number of proteins with ViPhOG annotations at each taxonomic level, required for taxonomic assignment (default: 2)",
default=2,
)
parser.add_argument(
"--prop",
dest="prot_prop",
type=float,
help="Minimum proportion of proteins in a contig that must have a ViPhOG annotation in order to provide a taxonomic assignment (default: 0.1)",
default=0.1,
"-f",
"--factor",
dest="factor",
help="File containing viphog-specific multiplication factors.",
)
parser.add_argument(
"--taxthres",
Expand All @@ -129,19 +179,5 @@ def contig_tax(annot_df, ncbi_db, min_prot, prop_annot, tax_thres):
default=".",
)
args = parser.parse_args()
input_df = pd.read_csv(args.input_file, sep="\t")
file_header = ["contig_ID", "genus", "subfamily", "family", "order"]
output_gen = contig_tax(
input_df, args.ncbi_db, args.min_prot, args.prot_prop, args.tax_thres
)
print(args.input_file)

out_file = re.split(r"\.[a-z]+$", os.path.basename(args.input_file))[0]

with open(
os.path.join(args.outdir, out_file + "_taxonomy.tsv"), "w", newline=""
) as output_file:
tsv_writer = csv.writer(output_file, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
tsv_writer.writerow(file_header)
for item in output_gen:
tsv_writer.writerow(item)
main(args)
25 changes: 0 additions & 25 deletions bin/get_taxdb_file.py

This file was deleted.

2 changes: 1 addition & 1 deletion bin/ratio_evalue_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ def ratio_evalue(vphmm_df, taxa_dict, evalue):
help="domtbl generated with Generate_vphmm_hmmer_matrix.py",
required=True)
parser.add_argument("-t", "--taxa", dest="taxa_tsv",
help="TSV file: additional_data_vpHMMs_v{1,2}.tsv", required=True)
help="TSV file: additional_data_vpHMMs_v{1,2,3,4}.tsv", required=True)
parser.add_argument("-o", "--outfile", dest="out_file",
help="Output table name (default: cwd)",
default=".")
Expand Down
8 changes: 4 additions & 4 deletions containers/r_chromomap/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
FROM rocker/r-ver:3.5.0
FROM rocker/r-ver:4.3

LABEL base_image="rocker/verse:3.5.0"
LABEL version="0.2"
LABEL base_image="rocker/verse:4.3"
LABEL version="0.3"
LABEL about.summary="r visualization packages"
LABEL about.license="SPDX:Apache-2.0"
LABEL about.tags="r, visualization"
LABEL about.home="https://cran.r-project.org/web/packages/chromoMap/, https://cran.r-project.org/web/packages/ggplot2/, https://cran.r-project.org/web/packages/plotly/"
LABEL software="r_chromoMap"
LABEL software.version="3.15"
LABEL software.version="4.1.1"

LABEL maintainer="MGnify team https://www.ebi.ac.uk/support/metagenomics"

Expand Down
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,15 @@ params {
evalue = 0.01
prop = 0.1
taxthres = 0.6
factor = "$projectDir/references/viphogs_cds_per_taxon_cummulative.csv"

sankey = 25
chunk = 10
length = 1.5 // in kb, so 0.5 filters all contigs shorter 500 nt

// development
viphog_version = 'v3'
meta_version = 'v2'
meta_version = 'v4'

// folder structure
output = 'results'
Expand Down
2 changes: 1 addition & 1 deletion nextflow/configs/container.config
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ process {
withLabel: ruby { container = 'quay.io/microbiome-informatics/bioruby:2.0.1' }
withLabel: pprmeta { container = 'quay.io/microbiome-informatics/pprmeta:1.1' }
withLabel: prodigal { container = 'quay.io/biocontainers/prodigal:2.6.3--hec16e2b_4' }
withLabel: chromomap { container = 'quay.io/microbiome-informatics/r_chromomap:0.2' }
withLabel: chromomap { container = 'quay.io/microbiome-informatics/r_chromomap:0.3' }
withLabel: multiqc { container = 'quay.io/biocontainers/multiqc:1.9--py_1' }
withLabel: fastqc { container = 'quay.io/biocontainers/fastqc:0.11.9--hdfd78af_1' }
withLabel: basics { container = 'quay.io/microbiome-informatics/virify-basics:1.0' }
Expand Down
Binary file added nextflow/figures/2023-sankey-neto.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified nextflow/figures/virify_fig1_workflow.png
100755 → 100644
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 1207242

Please sign in to comment.