⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

Sidduppal · 2022-07-18T13:44:21Z

⬆️ 🎨 Create autometa-setup-gtdb entrypoint for creating compatible GTDB databases
⬆️ gtdb_to_taxdump is now installed
⬆️ 🎨 Put common functionality for parsing of taxonomic databases in TaxonomyDatabase class

Commands to replicate:

Test GTDB

Running `autometa-setup-gtdb` entrypoint

autometa-setup-gtdb --taxa-files /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonomy_gtdb_releases/R207/bac120_taxonomy_r207.tsv /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonomy_gtdb_releases/R207/ar53_taxonomy_r207.tsv --faa-tarball /media/bigdrive1/Databases/gtdb_proteins_aa_reps_r207.tar.gz --dbdir $HOME/gtdbTest --keep-temp

Running `autometa-taxonomy-lca` entrypoint

autometa-taxonomy-lca --blast /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/78Mbp/78Mbp_gtdb_blastP.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb --sseqid2taxid-output 78Mbp_sseqid2taxid_test.tsv --lca-error-taxids 78Mbp_lcaErrorTaxids_test.tsv --verbose --lca-output 78Mbp_LCAout_test.tsv

Running `autometa-taxonomy-majority-vote` entrypoint

autometa-taxonomy-majority-vote --lca 78Mbp_LCAout_test.tsv --output 78Mbp_gtdb_majority_vote.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --verbose --dbtype gtdb

Running `autometa-taxonomy` entrypoint

autometa-taxonomy --votes 78Mbp_gtdb_majority_vote.tsv --assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --output testTaxonomy --split-rank-and-write superkingdom --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb

Running `autometa-summary` entrypoint

autometa-binning-summary --binning-main /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.main.tsv --markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb --output-stats binningSummartStats.tsv --output-taxonomy binningTaxa.tsv --output-metabins metaBins --metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna

Test NCBI

Running `autometa-taxonomy-lca` entrypoint

autometa-taxonomy-lca --blast /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.blastp.tsv --dbdir /media/bigdrive1/Databases/autometa_databases/ --lca-output lcaOut_ncbi.tsv --sseqid2taxid-output sseqid2taxid_ncbiTest.tsv --lca-error-taxids lcaError_ncbi.tsv --verbose

Running `autometa-taxonomy-majority-vote` entrypoint

autometa-taxonomy-majority-vote --lca lcaOut_ncbi.tsv --output ncbi_majority_vote.tsv --dbdir /media/bigdrive1/Databases/autometa_databases/ --verbose --dbtype ncbi

Running `autometa-taxonomy` entrypoint

autometa-taxonomy --votes ncbi_majority_vote.tsv --assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --output testNCBI --split-rank-and-write superkingdom --dbdir /media/bigdrive1/Databases/autometa_databases/

Running `autometa-summary` entrypoint

autometa-binning-summary --binning-main /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.main.tsv --markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv --metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --dbdir  /media/bigdrive1/Databases/autometa_databases/ --output-stats binningSummartStats2.tsv --output-taxonomy binningTaxa2.tsv --output-metabins metaBins2

TODO:

Update documentation. eg. Add paths to GTDB database files.
Add docstrings for gtdb.py

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
Have you followed the pipeline conventions in the contribution docs

🎨 Decouple taxonomy functions

Replace dbdir with dbpath in vote.py line 103

github-actions · 2022-07-18T13:45:39Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 7f86c81

+| ✅  62 tests passed       |+
#| ❔  34 tests were ignored |#
!| ❗   9 tests had warnings |!

❗ Test warnings:

readme - README did not have a Nextflow minimum version badge.
readme - README did not have a Nextflow minimum version mentioned in Quick Start section.
schema_lint - Schema $id should be https://raw.githubusercontent.com/autometa/master/nextflow_schema.json
Found https://raw.githubusercontent.com/autometa/main/nextflow_schema.json
schema_description - No description provided in schema for parameter: plaintext_email
schema_description - No description provided in schema for parameter: custom_config_version
schema_description - No description provided in schema for parameter: custom_config_base
schema_description - No description provided in schema for parameter: hostnames
schema_description - No description provided in schema for parameter: show_hidden_params
schema_description - No description provided in schema for parameter: singularity_pull_docker_container

❔ Tests ignored:

files_exist - File is ignored: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File is ignored: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File is ignored: .github/workflows/branch.yml
files_exist - File is ignored: .github/workflows/ci.yml
files_exist - File is ignored: .github/workflows/awstest.yml
files_exist - File is ignored: .github/workflows/awsfulltest.yml
files_exist - File is ignored: assets/nf-core-autometa_logo_light.png
files_exist - File is ignored: docs/usage.md
files_exist - File is ignored: docs/output.md
files_exist - File is ignored: docs/images/nf-core-autometa_logo.png
files_exist - File is ignored: docs/images/nf-core-autometa_logo_light.png
files_exist - File is ignored: docs/images/nf-core-autometa_logo_dark.png
files_exist - File is ignored: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File is ignored: .github/ISSUE_TEMPLATE/feature_request.md
nextflow_config - nextflow_config
files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/bug_report.yml
files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/feature_request.yml
files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
files_unchanged - File does not exist: .github/workflows/branch.yml
files_unchanged - File ignored due to lint config: .github/workflows/linting_comment.yml
files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
files_unchanged - File ignored due to lint config: assets/email_template.html
files_unchanged - File ignored due to lint config: assets/email_template.txt
files_unchanged - File does not exist: assets/nf-core-autometa_logo_light.png
files_unchanged - File does not exist: docs/images/nf-core-autometa_logo_light.png
files_unchanged - File does not exist: docs/images/nf-core-autometa_logo_dark.png
files_unchanged - File ignored due to lint config: docs/README.md
files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
files_unchanged - File ignored due to lint config: .gitignore or foo
actions_ci - '.github/workflows/ci.yml' not found
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/Autometa/Autometa/.github/workflows/awstest.yml
template_strings - template_strings

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .markdownlint.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreSchema.groovy
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yaml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: lib/WorkflowAutometa.groovy
files_exist - File found: modules.json
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .travis.yml
files_unchanged - .gitattributes matches the template
files_unchanged - .markdownlint.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
files_unchanged - lib/NfcoreSchema.groovy matches the template
files_unchanged - assets/multiqc_config.yaml matches the template
pipeline_name_conventions - Name adheres to nf-core convention
schema_lint - Schema lint passed
schema_params - Schema matched params returned from nextflow config
actions_schema_validation - Workflow validation passed: docker_autometa.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: pytest_codecov.yml
actions_schema_validation - Workflow validation passed: docker_get_genomes_for_mock.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: docker_mock_data_reporter.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json

Run details

nf-core/tools version 2.2
Run at 2022-11-12 16:49:23

Sidduppal · 2022-07-18T13:45:55Z

One thing I noticed while testing the scripts:
The NCBI class is instantiated with the variable dbpath while the GTDB class is instantiated with dbdir. Nothing is breaking, but something to keep in mind as we go ahead and maybe make them both consistent.

🎨🐍 Add ability to download and format GTDB databases 🎨📝 Add documentation for using GTDB databases 🎨🐚🐛 Update entrypoints in bash workflows to use dbdir and dbtype 🎨🐍 Add gtdb section to default.config

Add databases.rst

codecov · 2022-07-21T02:14:30Z

Codecov Report

Base: 27.40% // Head: 27.35% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (54c168c) compared to base (86b9550).
Patch coverage: 30.30% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #284      +/-   ##
==========================================
- Coverage   27.40%   27.35%   -0.06%     
==========================================
  Files          48       50       +2     
  Lines        5469     5733     +264     
==========================================
+ Hits         1499     1568      +69     
- Misses       3970     4165     +195

Flag	Coverage Δ
unittests	`27.35% <30.30%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
autometa/binning/large_data_mode.py	`0.00% <0.00%> (ø)`
autometa/config/databases.py	`0.00% <0.00%> (ø)`
autometa/config/utilities.py	`25.00% <ø> (ø)`
autometa/taxonomy/majority_vote.py	`13.53% <11.53%> (+0.83%)`	⬆️
autometa/binning/summary.py	`41.48% <18.18%> (-0.38%)`	⬇️
autometa/taxonomy/gtdb.py	`21.01% <21.01%> (ø)`
autometa/taxonomy/lca.py	`9.37% <22.22%> (+0.57%)`	⬆️
autometa/common/utilities.py	`23.35% <33.33%> (+0.65%)`	⬆️
autometa/taxonomy/ncbi.py	`49.51% <35.00%> (-6.92%)`	⬇️
autometa/taxonomy/database.py	`51.85% <51.85%> (ø)`
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

🐛 Replace NCBI init kwarg dirpath to dbdir ✅ Update test_vote.py main and assign tests with main logic and updated NCBI call ✅ Update NCBI fixture in conftest.py 🔥 Remove unused types in NCBI.py

evanroyrees · 2022-07-21T21:07:15Z

.github/workflows/pytest_codecov.yml

@@ -26,7 +26,7 @@ jobs:
    strategy:
      matrix:
        os: [ubuntu-latest]
-        python-version: [3.7]
+        python-version: [3.8]


Why is the python version for tests being changed here? Is this necessary for gtdb_to_taxdump installation?

It was giving an error with 3.7, link. from typing import Union, List, Literal is only supported in versions >=3.8

🎨🔥 Remove logger comment in taxonomy/database.py 🎨 Add abstractmethod decorator to mandatory methods in TaxonomyDatabase.py 📝 Update documentation for working copy/paste for database configuration 📝 Update documentation for working copy/paste when following bash workflow

…oad proteins_aa_reps

evanroyrees · 2022-08-04T19:56:57Z

autometa-env.yml

+  - pip:
+    - gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump


This should be replaced by a conda distribution of gtdb_to_taxdump (will need to be made) to follow the bioconda recipe guidelines . This will help ensure autometa remains up-to-date and compatible with other bioinformatic tools. By the way, this tool can be submitted to bioconda on the behalf of the author. You should be able to submit a gtdb_to_taxdump bioconda recipe by following the bioconda contributing docs.

Suggested change

- pip:

- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump

- gtdb_to_taxdump>=0.1.8

Database download and formatting appears to be working. Just need gtdb_to_taxdump to be available on bioconda then we can update the Autometa recipe and get this merged in

With gtdb_to_taxdump (v 0.1.8):

Install this branch (and environment)

# In root of Autometa repo # update env mamba env update -n autometa -f=autometa-env.yml # install autometa comands make clean && make install

Set config to point to where we want formatted GTDB database files

autometa-config \ --section databases \ --option gtdb \ --value $HOME/autometa-testing/databases/gtdb

Download/format GTDB databases to configured path

NOTE: $HOME is not actually emitted. I've replaced the actual path with $HOME for security reasons.

(autometa) evan@userserver:~/Autometa$ autometa-update-databases --update-gtdb [08/04/2022 05:10:33 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_taxdump.py: gtdb_to_taxdump.py https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz --outdir $HOME/autometa-testing/databases/gtdb 2022-08-04 17:10:34,175 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz 2022-08-04 17:10:35,548 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz 2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/names.dmp 2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/nodes.dmp 2022-08-04 17:10:45,451 - File written: $HOME/autometa-testing/databases/gtdb/delnodes.dmp 2022-08-04 17:10:45,452 - File written: $HOME/autometa-testing/databases/gtdb/merged.dmp [08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: starting GTDB databases download [08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: wget https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_proteins_aa_reps.tar.gz -O $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz [08/04/2022 06:11:46 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_diamond.py: gtdb_to_diamond.py $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz $HOME/autometa-testing/databases/gtdb/names.dmp $HOME/autometa-testing/databases/gtdb/nodes.dmp --outdir $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond 2022-08-04 18:11:46,417 - Read nodes.dmp file: $HOME/autometa-testing/databases/gtdb/nodes.dmp 2022-08-04 18:11:46,500 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/nodes.dmp 2022-08-04 18:11:46,500 - Reading dumpfile: $HOME/autometa-testing/databases/gtdb/names.dmp 2022-08-04 18:11:47,561 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/names.dmp 2022-08-04 18:11:47,561 - No. of accession<=>taxID pairs: 401808 2022-08-04 18:11:47,561 - Extracting tarball: $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz 2022-08-04 18:11:47,561 - Extracting to: gtdb_to_diamond_TMP 2022-08-04 18:24:08,234 - No. of .faa(.gz) files: 65703 2022-08-04 18:24:08,256 - Creating accession2taxid table... 2022-08-04 18:24:08,376 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/accession2taxid.tsv 2022-08-04 18:24:08,376 - Formating & merging faa files... 2022-08-04 18:44:45,876 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa 2022-08-04 18:44:45,876 - No. of seqs. written: 200505361 2022-08-04 18:44:51,908 - Temp-dir removed: gtdb_to_diamond_TMP [08/04/2022 06:44:51 PM DEBUG] autometa.common.external.diamond: diamond makedb --in $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa --db $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.dmnd -p 96 (autometa) evan@userserver:~/Autometa$

evanroyrees · 2022-08-09T13:49:14Z

Looking at submitting a bioconda recipe for gtdb_to_taxdump and in the README it suggests using TaxonKit for retrieving stable taxids across GTDB releases (rather than arbitrarily assigning them as noted). Below is a screenshot from the gtdb_to_taxdump summary section:

Next Steps Towards Stability

TaxonKit is available via bioconda so would be easy to include in autometa-env.yml and documentation for generating these taxdump files from the GTDB looks straight-forward as outlined in the GTDB-taxdump repo

TaxonKit conda page: https://anaconda.org/bioconda/taxonkit
GTDB-taxdump page (using TaxonKit): https://github.com/shenwei356/gtdb-taxdump

NOTE: TaxonKit v0.12 or greater should be used. (ref)

The taxdump files may also be downloaded directly from the releases page: https://github.com/shenwei356/gtdb-taxdump/releases which reduces the compute requirements.

That being said, if we would like to generate the GTDB taxdump files using this, the steps are outlined here: https://github.com/shenwei356/gtdb-taxdump#steps

Looking in to the future, this is probably the more appropriate route as taxids may be relied up across future GTDB releases.

P.S there is also a related project linked (https://github.com/shenwei356/ictv-taxdump) that generates an NCBI taxdump for viruses and may be useful in the future (@kaw97, is this of any interest to you?)

chasemc · 2022-09-02T17:37:11Z

Not sure if this affects what y'all are doing:
shenwei356/gtdb-taxdump#2

shenwei356 · 2022-09-07T15:28:13Z

Not sure if this affects what y'all are doing: shenwei356/gtdb-taxdump#2

Only merged.dmp and delnodes.dmp are affected. If you just need to use the taxonomy data of the current version, say R207, don't worry.

chasemc · 2022-09-07T17:52:30Z

Thanks @shenwei356

Sidduppal · 2022-09-18T17:28:10Z

@WiscEvan The scripts have been updated to use taxonkit's already generated files. I decided to use merged.dmp and delnodes.dmp as they were provided with the taxdump. Although if you think the issue mentioned above by @chasemc would skew the results a LOT we can remove their usage. I'm mentioning the commands for testing:

Setting up the database

The taxdump files needs to downloaded and extracted - link

autometa-config     \
--section databases     \
--option gtdb     \
--value /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test
autometa-update-databases --update-gtdb

Run LCA

autometa-taxonomy-lca \
--blast 78mbp_metagenome.blastp.gtdb.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--sseqid2taxid-output 78Mbp_sseqid2taxid_test.tsv \
--lca-error-taxids 78Mbp_lcaErrorTaxids_test.tsv \
--verbose \
--lca-output 78Mbp_LCAout_test.tsv

Run Majority vote

autometa-taxonomy-majority-vote \
--lca 78Mbp_LCAout_test.tsv \
--output 78Mbp_gtdb_majority_vote.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \ 
--verbose \
--dbtype gtdb

Run taxonomy

autometa-taxonomy \
--votes 78Mbp_gtdb_majority_vote.tsv \
--assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna \
--output testTaxonomy \
--split-rank-and-write superkingdom \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb

Run summary

autometa-binning-summary \
--binning-main 78_binningMain_gtdb.tsv \
--markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--output-stats binningSummartStats.tsv --output-taxonomy binningTaxa.tsv \
--output-metabins metaBins \
--metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna

By using this we don't need to add any additional dependency as well. Let me know what you think.

evanroyrees

It does not appear that you are using delnodes.dmp in gtdb.py other than reading the file.
It does not appear that you are using merged.dmp in gtdb.py other than reading the file.
Tests are currently failing and should be double-checked

There are some other comments attached. I'll review more after these are addressed. 👍 🚧 👷

autometa/config/default.config

autometa/config/databases.py

autometa/taxonomy/gtdb.py

evanroyrees · 2022-09-22T20:31:35Z

autometa/taxonomy/gtdb.py

+    for dirpath, dirs, files in os.walk(reps_faa):
+        for file in files:
+            if file.endswith("_protein.faa"):
+                faa_file = os.path.join(dirpath, file)
+                # Regex to get the genome accession from the file path


Rather than walking with multiple for loops, why not just glob _protein.faa?

genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)

faa_index = { re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", fpath).group() : fpath for fpath in glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True) }

from typing import Dict genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True) faa_index: Dict[str, str] = {} for genome_protein_faa_filepath in genome_protein_faa_filepaths: genome_acc_search = re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", genome_protein_faa_filepath) if not genome_acc_search: continue genome_accession = genome_acc_search.group() faa_index[genome_accession] = genome_protein_faa_filepath

autometa/taxonomy/gtdb.py

Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>

🐛📝 Change densne choice to densmap 🎨🔥 Remove uncecessary if/then logic for incorporation of gtdb taxonomy sub-workflow TODO: Update or revert changes to autometa-large-data-mode.sh

🔥🎨🐚 Rename cache name to use $simple_name and $dbtype rather than $prefix

autometa/binning/large_data_mode.py

evanroyrees

Still not sure why you are running in to an error with the autometa-large-data-mode.sh workflow. From inspecting your output directory, it looks like only 11 contigs were classified as bacteria, so maybe this is causing the issue? Large data mode was not intended for 11 contigs

evanroyrees · 2022-10-26T14:17:58Z

workflows/autometa-large-data-mode.sh

-    cache="${outdir}/${simpleName}_${kingdom}_cache"
-    output_binning="${outdir}/${simpleName}.${kingdom}.${clustering_method}.tsv"
-    output_main="${outdir}/${simpleName}.${kingdom}.${clustering_method}.main.tsv"
+    cache="${outdir}/${simple_name}_${kingdom}_cache"


This was not a typo, this should read out outdir/sample_name_gtdb_bacteria_cache or outdir/sample_name_ncbi_bacteria_cache

Suggested change

cache="${outdir}/${simple_name}_${kingdom}_cache"

cache="${outdir}/${simple_name}_${dbtype}_${kingdom}_cache"

evanroyrees · 2022-10-26T14:31:00Z

workflows/autometa-large-data-mode.sh

+    echo "Running GTDB taxon assignment step."
+
+    # output
+    gtdb_orfs="${outdir}/${prefix}.orfs.faa"


Do you think gtdb_orfs is misleading? These are bacterial and archael ORFs identified against NCBI's nr database. So maybe gtdb_input_orfs or something else that would be more appropriate? What do you think?

Suggested change

gtdb_orfs="${outdir}/${prefix}.orfs.faa"

gtdb_input_orfs="${outdir}/${prefix}.orfs.faa"

evanroyrees · 2022-11-04T16:51:21Z

workflows/autometa.sh

+    # $blast --> Generated from step 5.2
+    # $gtdb --> User Input
+    # $dbtype --> Updated according to $taxa_routine
+    dbtype="gtdb"


dbtype="gtdb" is also on line 247. Is this necessary, or can we delete this?

evanroyrees · 2022-11-04T16:53:48Z

workflows/autometa.sh

+    # e.g. ${outdir}/${prefix}.gtdb.bacteria.fna
+    # e.g. ${outdir}/${prefix}.gtdb.archaea.fna
+    # e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv


gtdb, i.e. $dbtype is included in $prefix

Suggested change

# e.g. ${outdir}/${prefix}.gtdb.bacteria.fna

# e.g. ${outdir}/${prefix}.gtdb.archaea.fna

# e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv

# e.g. ${outdir}/${prefix}.bacteria.fna

# e.g. ${outdir}/${prefix}.archaea.fna

# e.g. ${outdir}/${prefix}.taxonomy.tsv

evanroyrees · 2022-11-11T19:05:06Z

workflows/autometa.sh

+
+        kingdom_fasta="${outdir}/${prefix}.${kingdom}.fna"
+
+        contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"


This writes sequence headers with the _ suffix. This should be more appropriately named {...}.orfPrefixes.txt and the variable name changed to orf_id_prefixes

Suggested change

contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"

orf_prefixes="${outdir}/${prefix}.${kingdom}.orfPrefixes.txt"

NOTE: contig_ids will need to be changed to orf_prefixes where appropriate as well

evanroyrees · 2022-11-11T19:06:06Z

workflows/autometa.sh

+            sed 's/$/_/' | \
+            cut -f1 -d" " > $contig_ids
+        # Retrieve ORF IDs from contig IDs
+        grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids


NOTE: See comment above

Suggested change

grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids

grep -f $orf_prefixes $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids

evanroyrees · 2022-11-11T19:07:35Z

workflows/autometa.sh

+    # Step 5.2: Run blastp
+    # input:
+    # $gtdb_input_orfs --> Generated from step 5.1
+    gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script)


I ended up using find here on CHTC, which may be a little more robust. Do you think this is appropriate?

Suggested change

gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script)

gtdb_dmnd_db=$(find $gtdb -name "gtdb.dmnd") # generated using autometa-setup-gtdb (Must be performed prior to using this script)

Not a bad idea. I can do that.

📝 Fix incorrect unclustered.fasta path

evanroyrees

Looks good. 👍 Release forthcoming

Sidduppal added 2 commits July 16, 2022 16:52

⬆️ 🎨 Add database.py and gtdb.py

9633568

🎨 Decouple taxonomy functions

🎨 Set default dbtype as ncbi in autometa-summary

e9d25c5

Replace dbdir with dbpath in vote.py line 103

Sidduppal added the enhancement New feature or request label Jul 18, 2022

Sidduppal requested a review from chasemc July 18, 2022 13:44

Sidduppal assigned evanroyrees and Sidduppal Jul 18, 2022

Sidduppal added 6 commits July 20, 2022 12:53

🎨🐚🐍📝 Add GTDB database handling

8eb37da

🎨🐍 Add ability to download and format GTDB databases 🎨📝 Add documentation for using GTDB databases 🎨🐚🐛 Update entrypoints in bash workflows to use dbdir and dbtype 🎨🐍 Add gtdb section to default.config

🔥 Reset unnecessary file changes

7cc9fbd

🐛 📝 Fix read the docs build bug

fc43bf8

Upgrade python version to 3.8 in pytest_codecov.yml

13fc07c

Add gtdb workflow

52ccffa

Add databases.rst

🐛 Rename ncbi to taxa_db

4103cf1

✅🐍🎨🐛 Fix NCBI kwarg

11bdb83

🐛 Replace NCBI init kwarg dirpath to dbdir ✅ Update test_vote.py main and assign tests with main logic and updated NCBI call ✅ Update NCBI fixture in conftest.py 🔥 Remove unused types in NCBI.py

evanroyrees reviewed Jul 21, 2022

View reviewed changes

Sidduppal requested a review from evanroyrees July 22, 2022 12:14

🐛🎨 Fix bug in autometa-update-databases --update-gtdb. Now only downl…

47ab03f

…oad proteins_aa_reps

evanroyrees mentioned this pull request Aug 1, 2022

Documentation update #278

Closed

3 tasks

evanroyrees requested changes Aug 4, 2022

View reviewed changes

🎨 Reformat the files to use taxonkit instead of gtdb-taxdump

5b7266c

evanroyrees requested changes Sep 22, 2022

View reviewed changes

Apply suggestions from code review

2eb67bc

Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>

evanroyrees and others added 9 commits October 21, 2022 12:19

🔥 Remove workflow files

119bf3d

🐛📝 Change densne choice to densmap 🎨🔥 Remove uncecessary if/then logic for incorporation of gtdb taxonomy sub-workflow TODO: Update or revert changes to autometa-large-data-mode.sh

🔥 Remove old GTDB workflow from autometa-large-data-mode.sh

e6df890

🎨🐚 Change camelCase to snake_case

c6f0d5d

🎨🐚 Add $dbtype variable to reduce redundancy

64d6064

🎨🐚 Update large-data-mode variable assignments

ed1cba8

⬆️💚 pin pytest dependencies (scipy, joblib)

b207159

Update workflows

e1b7648

Fix typo in cache directory path

5fd5f46

🐛🐍 Create cache directory when it does not exist

55c8014

🔥🎨🐚 Rename cache name to use $simple_name and $dbtype rather than $prefix

evanroyrees reviewed Oct 25, 2022

View reviewed changes

autometa/binning/large_data_mode.py Show resolved Hide resolved

Sidduppal added 3 commits October 25, 2022 16:43

Fix typo in large_data_mode.py

535ea97

Pulling

4463f1d

Fix typo in workflows/autometa-large-data-mode.sh

d777d0e

evanroyrees requested changes Oct 26, 2022

View reviewed changes

evanroyrees and others added 2 commits November 3, 2022 14:11

🎨🐛 Fix checking file size on a non-existent file error

12a7bcd

🐛🐍 Check cache prior to making outdir path

8a0d784

evanroyrees reviewed Nov 4, 2022

View reviewed changes

Sidduppal added 2 commits November 4, 2022 13:36

Address comments in the PR

9a328c0

Address comments in the PR

684aee4

Sidduppal requested a review from evanroyrees November 4, 2022 18:40

evanroyrees reviewed Nov 11, 2022

View reviewed changes

🎨 Rename contigs_ids to orf_prefix

7f86c81

Sidduppal requested a review from evanroyrees November 12, 2022 16:48

🔥 Remove redundant dbtype var

54c168c

📝 Fix incorrect unclustered.fasta path

evanroyrees approved these changes Dec 16, 2022

View reviewed changes

evanroyrees merged commit 9772257 into dev Dec 16, 2022

evanroyrees deleted the gtdb_to_autometa branch December 20, 2022 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

Sidduppal commented Jul 18, 2022 •

edited

Loading

github-actions bot commented Jul 18, 2022 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

Sidduppal commented Jul 18, 2022

codecov bot commented Jul 21, 2022 •

edited

Loading

evanroyrees Jul 21, 2022

Sidduppal Jul 22, 2022

evanroyrees Aug 4, 2022 •

edited

Loading

evanroyrees Aug 5, 2022

evanroyrees commented Aug 9, 2022

chasemc commented Sep 2, 2022

shenwei356 commented Sep 7, 2022

chasemc commented Sep 7, 2022

Sidduppal commented Sep 18, 2022 •

edited

Loading

evanroyrees left a comment

evanroyrees Sep 22, 2022

evanroyrees left a comment

evanroyrees Oct 26, 2022

evanroyrees Oct 26, 2022

evanroyrees Nov 4, 2022

evanroyrees Nov 4, 2022

evanroyrees Nov 11, 2022

evanroyrees Nov 11, 2022

evanroyrees Nov 11, 2022 •

edited

Loading

Sidduppal Nov 12, 2022

evanroyrees left a comment

		- pip:
		- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump

	- pip:
	- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump
	- gtdb_to_taxdump>=0.1.8

	cache="${outdir}/${simple_name}_${kingdom}_cache"
	cache="${outdir}/${simple_name}_${dbtype}_${kingdom}_cache"

	gtdb_orfs="${outdir}/${prefix}.orfs.faa"
	gtdb_input_orfs="${outdir}/${prefix}.orfs.faa"


		kingdom_fasta="${outdir}/${prefix}.${kingdom}.fna"

		contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"

	contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"
	orf_prefixes="${outdir}/${prefix}.${kingdom}.orfPrefixes.txt"

	grep -f $contig_ids $orfs \| sed 's/^>//' \| cut -f1 -d" " > $orf_ids
	grep -f $orf_prefixes $orfs \| sed 's/^>//' \| cut -f1 -d" " > $orf_ids

	gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script)
	gtdb_dmnd_db=$(find $gtdb -name "gtdb.dmnd") # generated using autometa-setup-gtdb (Must be performed prior to using this script)

⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

Conversation

Sidduppal commented Jul 18, 2022 • edited Loading

Test GTDB

Running autometa-setup-gtdb entrypoint

Running autometa-taxonomy-lca entrypoint

Running autometa-taxonomy-majority-vote entrypoint

Running autometa-taxonomy entrypoint

Running autometa-summary entrypoint

Test NCBI

Running autometa-taxonomy-lca entrypoint

Running autometa-taxonomy-majority-vote entrypoint

Running autometa-taxonomy entrypoint

Running autometa-summary entrypoint

TODO:

PR checklist

github-actions bot commented Jul 18, 2022 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

Sidduppal commented Jul 18, 2022

codecov bot commented Jul 21, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanroyrees Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Install this branch (and environment)

Set config to point to where we want formatted GTDB database files

Download/format GTDB databases to configured path

evanroyrees commented Aug 9, 2022

Next Steps Towards Stability

chasemc commented Sep 2, 2022

shenwei356 commented Sep 7, 2022

chasemc commented Sep 7, 2022

Sidduppal commented Sep 18, 2022 • edited Loading

Setting up the database

Run LCA

Run Majority vote

Run taxonomy

Run summary

evanroyrees left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanroyrees left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanroyrees Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanroyrees left a comment

Choose a reason for hiding this comment

Sidduppal commented Jul 18, 2022 •

edited

Loading

Running `autometa-setup-gtdb` entrypoint

Running `autometa-taxonomy-lca` entrypoint

Running `autometa-taxonomy-majority-vote` entrypoint

Running `autometa-taxonomy` entrypoint

Running `autometa-summary` entrypoint

Running `autometa-taxonomy-lca` entrypoint

Running `autometa-taxonomy-majority-vote` entrypoint

Running `autometa-taxonomy` entrypoint

Running `autometa-summary` entrypoint

github-actions bot commented Jul 18, 2022 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

codecov bot commented Jul 21, 2022 •

edited

Loading

evanroyrees Aug 4, 2022 •

edited

Loading

Sidduppal commented Sep 18, 2022 •

edited

Loading

evanroyrees Nov 11, 2022 •

edited

Loading