Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VirSorter2 process as an alternative to VirSorter #128

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

fischer-hub
Copy link

@fischer-hub fischer-hub commented Jul 17, 2024

Added VirSorter2 as a new process that can be used instead of VirSorter using the new flag --use_virsort2, adapted the downstream processes (the parse process and GFF generation script) to work with the different results of VirSorter2. E.g.: VirSorter2 reports a confidence score (0-1) for every viral hit in the input data instead of a category.

Also had to add some changes to the GFF generation process because I always ran into file collision issues when more than one of the input samples reported significant viral sequences, since the VirSort and VirFinder reports have the same file name for every sample. Might also be solved by running the GFF script for every sample containing viral seqs instead of once for all samples but I already asked about this in #127, maybe I'm missing something here!

@hoelzer
Copy link
Collaborator

hoelzer commented Jul 17, 2024

Great, thanks a lot!

@mberacochea @guille0387: @fischer-hub is working with me at the RKI and we anyway wanted to use VS2 for some annotations. Thus, David kindly added the functionality to VIRify. Of couse, this would need more proper benchmarking but the idea would be to swap VS against VS2 at some point.

For now, there would be this parameter switch possible.

@mberacochea
Copy link
Member

@hoelzer @fischer-hub this is amazing, thank you so much. This has been in our backlog for ages. I will go over the PR as soon as I can, but from a quick read it looks great.

Copy link
Member

@mberacochea mberacochea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fischer-hub, great stuff. I left a few comments, nothing major.
The one thing I would like to add is a few more unit tests for parse_viral_pred.py, can you send me an example of the virsorter2 outputs?

bin/parse_viral_pred.py Outdated Show resolved Hide resolved
bin/parse_viral_pred.py Outdated Show resolved Hide resolved
bin/parse_viral_pred.py Outdated Show resolved Hide resolved
bin/parse_viral_pred.py Outdated Show resolved Hide resolved
bin/parse_viral_pred.py Outdated Show resolved Hide resolved
virify.nf Outdated
Comment on lines 507 to 509
viphos_annotations = annotation.out.map { _, __, annotations -> tuple(annotations, i++) }.collect(){it -> it[0]} //{ annotations, count -> "$annotations".replace('.', '_' + count + '.') }
taxonomy_annotations = assign.out.map { _, __, taxonomy -> tuple(taxonomy, j++) }.collect(){it -> it[0]} //{ taxonomy, count -> "$taxonomy".replace('.', '_' + count + '.') }
checkv_results = checkV.out.map { _, __, quality_summary, ___ -> tuple(quality_summary, k++) }.collect(){it -> it[0]} //{ quality_summary, count -> "$quality_summary".replace('.', '_' + count + '.') }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support the -list of assemblies, correct?

Copy link
Author

@fischer-hub fischer-hub Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running on a list of samples this was my first fix to prevent the write_gff file collision, but I think this should be sufficient as well since then the input files are enumerated and have different names too. (I just forgot to remove this afterwards, will check again)

Also this might be solved when removing the .first() from contigs.first() later on because then each sample will run in its own write_gff process instance? Or could there be annotation files from the same sample with the same name too, e.g.: from different contigs of the same assembly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The change on the contigs.first() it a but more invovled as we need to .join() the assemblies with the corresponding annotations. Otherwise, the pipeline could mix results from different assemblies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true. Maybe its better then to keep the fix and handle the bug in a separate PR.

virify.nf Show resolved Hide resolved
@fischer-hub
Copy link
Author

Thanks @fischer-hub, great stuff. I left a few comments, nothing major. The one thing I would like to add is a few more unit tests for parse_viral_pred.py, can you send me an example of the virsorter2 outputs?

Sure, will go over the changes asap and attach some example output here!

fischer-hub and others added 5 commits August 15, 2024 19:33
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
Co-authored-by: Martín Beracochea <mbc@ebi.ac.uk>
@KateSakharova KateSakharova mentioned this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants