-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VirSorter2 process as an alternative to VirSorter #128
base: master
Are you sure you want to change the base?
Conversation
Great, thanks a lot! @mberacochea @guille0387: @fischer-hub is working with me at the RKI and we anyway wanted to use VS2 for some annotations. Thus, David kindly added the functionality to VIRify. Of couse, this would need more proper benchmarking but the idea would be to swap VS against VS2 at some point. For now, there would be this parameter switch possible. |
@hoelzer @fischer-hub this is amazing, thank you so much. This has been in our backlog for ages. I will go over the PR as soon as I can, but from a quick read it looks great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fischer-hub, great stuff. I left a few comments, nothing major.
The one thing I would like to add is a few more unit tests for parse_viral_pred.py, can you send me an example of the virsorter2 outputs?
virify.nf
Outdated
viphos_annotations = annotation.out.map { _, __, annotations -> tuple(annotations, i++) }.collect(){it -> it[0]} //{ annotations, count -> "$annotations".replace('.', '_' + count + '.') } | ||
taxonomy_annotations = assign.out.map { _, __, taxonomy -> tuple(taxonomy, j++) }.collect(){it -> it[0]} //{ taxonomy, count -> "$taxonomy".replace('.', '_' + count + '.') } | ||
checkv_results = checkV.out.map { _, __, quality_summary, ___ -> tuple(quality_summary, k++) }.collect(){it -> it[0]} //{ quality_summary, count -> "$quality_summary".replace('.', '_' + count + '.') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to support the -list of assemblies, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running on a list of samples this was my first fix to prevent the write_gff
file collision, but I think this should be sufficient as well since then the input files are enumerated and have different names too. (I just forgot to remove this afterwards, will check again)
Also this might be solved when removing the .first()
from contigs.first()
later on because then each sample will run in its own write_gff process instance? Or could there be annotation files from the same sample with the same name too, e.g.: from different contigs of the same assembly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The change on the contigs.first() it a but more invovled as we need to .join() the assemblies with the corresponding annotations. Otherwise, the pipeline could mix results from different assemblies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true. Maybe its better then to keep the fix and handle the bug in a separate PR.
Sure, will go over the changes asap and attach some example output here! |
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
Added VirSorter2 as a new process that can be used instead of
VirSorter
using the new flag--use_virsort2
, adapted the downstream processes (theparse
process and GFF generation script) to work with the different results ofVirSorter2
. E.g.:VirSorter2
reports a confidence score (0-1) for every viral hit in the input data instead of a category.Also had to add some changes to the GFF generation process because I always ran into file collision issues when more than one of the input samples reported significant viral sequences, since the
VirSort
andVirFinder
reports have the same file name for every sample. Might also be solved by running the GFF script for every sample containing viral seqs instead of once for all samples but I already asked about this in #127, maybe I'm missing something here!