Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284

Merged
merged 50 commits into from
Dec 16, 2022

Conversation

Sidduppal
Copy link
Collaborator

@Sidduppal Sidduppal commented Jul 18, 2022

  • ⬆️ 🎨 Create autometa-setup-gtdb entrypoint for creating compatible GTDB databases
  • ⬆️ gtdb_to_taxdump is now installed
  • ⬆️ 🎨 Put common functionality for parsing of taxonomic databases in TaxonomyDatabase class

Commands to replicate:

Test GTDB

Running autometa-setup-gtdb entrypoint

autometa-setup-gtdb --taxa-files /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonomy_gtdb_releases/R207/bac120_taxonomy_r207.tsv /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonomy_gtdb_releases/R207/ar53_taxonomy_r207.tsv --faa-tarball /media/bigdrive1/Databases/gtdb_proteins_aa_reps_r207.tar.gz --dbdir $HOME/gtdbTest --keep-temp

Running autometa-taxonomy-lca entrypoint

autometa-taxonomy-lca --blast /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/78Mbp/78Mbp_gtdb_blastP.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb --sseqid2taxid-output 78Mbp_sseqid2taxid_test.tsv --lca-error-taxids 78Mbp_lcaErrorTaxids_test.tsv --verbose --lca-output 78Mbp_LCAout_test.tsv

Running autometa-taxonomy-majority-vote entrypoint

autometa-taxonomy-majority-vote --lca 78Mbp_LCAout_test.tsv --output 78Mbp_gtdb_majority_vote.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --verbose --dbtype gtdb

Running autometa-taxonomy entrypoint

autometa-taxonomy --votes 78Mbp_gtdb_majority_vote.tsv --assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --output testTaxonomy --split-rank-and-write superkingdom --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb

Running autometa-summary entrypoint

autometa-binning-summary --binning-main /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.main.tsv --markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/gtdb_to_diamond --dbtype gtdb --output-stats binningSummartStats.tsv --output-taxonomy binningTaxa.tsv --output-metabins metaBins --metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna

Test NCBI

Running autometa-taxonomy-lca entrypoint

autometa-taxonomy-lca --blast /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.blastp.tsv --dbdir /media/bigdrive1/Databases/autometa_databases/ --lca-output lcaOut_ncbi.tsv --sseqid2taxid-output sseqid2taxid_ncbiTest.tsv --lca-error-taxids lcaError_ncbi.tsv --verbose

Running autometa-taxonomy-majority-vote entrypoint

autometa-taxonomy-majority-vote --lca lcaOut_ncbi.tsv --output ncbi_majority_vote.tsv --dbdir /media/bigdrive1/Databases/autometa_databases/ --verbose --dbtype ncbi

Running autometa-taxonomy entrypoint

autometa-taxonomy --votes ncbi_majority_vote.tsv --assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --output testNCBI --split-rank-and-write superkingdom --dbdir /media/bigdrive1/Databases/autometa_databases/

Running autometa-summary entrypoint

autometa-binning-summary --binning-main /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.main.tsv --markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv --metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna --dbdir  /media/bigdrive1/Databases/autometa_databases/ --output-stats binningSummartStats2.tsv --output-taxonomy binningTaxa2.tsv --output-metabins metaBins2

TODO:

  • Update documentation. eg. Add paths to GTDB database files.
  • Add docstrings for gtdb.py

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Have you followed the pipeline conventions in the contribution docs

🎨 Decouple taxonomy functions
Replace dbdir with dbpath in vote.py line 103
@Sidduppal Sidduppal added the enhancement New feature or request label Jul 18, 2022
@Sidduppal Sidduppal requested a review from chasemc July 18, 2022 13:44
@github-actions
Copy link

github-actions bot commented Jul 18, 2022

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 7f86c81

+| ✅  62 tests passed       |+
#| ❔  34 tests were ignored |#
!| ❗   9 tests had warnings |!

❗ Test warnings:

  • readme - README did not have a Nextflow minimum version badge.
  • readme - README did not have a Nextflow minimum version mentioned in Quick Start section.
  • schema_lint - Schema $id should be https://raw.githubusercontent.com/autometa/master/nextflow_schema.json
    Found https://raw.githubusercontent.com/autometa/main/nextflow_schema.json
  • schema_description - No description provided in schema for parameter: plaintext_email
  • schema_description - No description provided in schema for parameter: custom_config_version
  • schema_description - No description provided in schema for parameter: custom_config_base
  • schema_description - No description provided in schema for parameter: hostnames
  • schema_description - No description provided in schema for parameter: show_hidden_params
  • schema_description - No description provided in schema for parameter: singularity_pull_docker_container

❔ Tests ignored:

  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/feature_request.yml
  • files_exist - File is ignored: .github/workflows/branch.yml
  • files_exist - File is ignored: .github/workflows/ci.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: assets/nf-core-autometa_logo_light.png
  • files_exist - File is ignored: docs/usage.md
  • files_exist - File is ignored: docs/output.md
  • files_exist - File is ignored: docs/images/nf-core-autometa_logo.png
  • files_exist - File is ignored: docs/images/nf-core-autometa_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-autometa_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/bug_report.md
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/feature_request.md
  • nextflow_config - nextflow_config
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/feature_request.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File does not exist: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting_comment.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File does not exist: assets/nf-core-autometa_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-autometa_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-autometa_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or foo
  • actions_ci - '.github/workflows/ci.yml' not found
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/Autometa/Autometa/.github/workflows/awstest.yml
  • template_strings - template_strings

✅ Tests passed:

Run details

  • nf-core/tools version 2.2
  • Run at 2022-11-12 16:49:23

@Sidduppal
Copy link
Collaborator Author

One thing I noticed while testing the scripts:
The NCBI class is instantiated with the variable dbpath while the GTDB class is instantiated with dbdir. Nothing is breaking, but something to keep in mind as we go ahead and maybe make them both consistent.

🎨🐍 Add ability to download and format GTDB databases
🎨📝 Add documentation for using GTDB databases
🎨🐚🐛 Update entrypoints in bash workflows to use dbdir and dbtype
🎨🐍 Add gtdb section to default.config
Add databases.rst
@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Base: 27.40% // Head: 27.35% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (54c168c) compared to base (86b9550).
Patch coverage: 30.30% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #284      +/-   ##
==========================================
- Coverage   27.40%   27.35%   -0.06%     
==========================================
  Files          48       50       +2     
  Lines        5469     5733     +264     
==========================================
+ Hits         1499     1568      +69     
- Misses       3970     4165     +195     
Flag Coverage Δ
unittests 27.35% <30.30%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
autometa/binning/large_data_mode.py 0.00% <0.00%> (ø)
autometa/config/databases.py 0.00% <0.00%> (ø)
autometa/config/utilities.py 25.00% <ø> (ø)
autometa/taxonomy/majority_vote.py 13.53% <11.53%> (+0.83%) ⬆️
autometa/binning/summary.py 41.48% <18.18%> (-0.38%) ⬇️
autometa/taxonomy/gtdb.py 21.01% <21.01%> (ø)
autometa/taxonomy/lca.py 9.37% <22.22%> (+0.57%) ⬆️
autometa/common/utilities.py 23.35% <33.33%> (+0.65%) ⬆️
autometa/taxonomy/ncbi.py 49.51% <35.00%> (-6.92%) ⬇️
autometa/taxonomy/database.py 51.85% <51.85%> (ø)
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

🐛 Replace NCBI init kwarg dirpath to dbdir
✅ Update test_vote.py main and assign tests with main logic and updated NCBI call
✅ Update NCBI fixture in conftest.py
🔥 Remove unused types in NCBI.py
@@ -26,7 +26,7 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest]
python-version: [3.7]
python-version: [3.8]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the python version for tests being changed here? Is this necessary for gtdb_to_taxdump installation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was giving an error with 3.7, link. from typing import Union, List, Literal is only supported in versions >=3.8

🎨🔥 Remove logger comment in taxonomy/database.py
🎨 Add abstractmethod decorator to mandatory methods in TaxonomyDatabase.py
📝 Update documentation for working copy/paste for database configuration
📝 Update documentation for working copy/paste when following bash workflow
@evanroyrees evanroyrees mentioned this pull request Aug 1, 2022
3 tasks
autometa-env.yml Outdated
Comment on lines 31 to 32
- pip:
- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump
Copy link
Collaborator

@evanroyrees evanroyrees Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be replaced by a conda distribution of gtdb_to_taxdump (will need to be made) to follow the bioconda recipe guidelines . This will help ensure autometa remains up-to-date and compatible with other bioinformatic tools. By the way, this tool can be submitted to bioconda on the behalf of the author. You should be able to submit a gtdb_to_taxdump bioconda recipe by following the bioconda contributing docs.

Suggested change
- pip:
- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump
- gtdb_to_taxdump>=0.1.8

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Database download and formatting appears to be working. Just need gtdb_to_taxdump to be available on bioconda then we can update the Autometa recipe and get this merged in

With gtdb_to_taxdump (v 0.1.8):

Install this branch (and environment)

# In root of Autometa repo
# update env
mamba env update -n autometa -f=autometa-env.yml
# install autometa comands
make clean && make install

Set config to point to where we want formatted GTDB database files

autometa-config \
    --section databases \
    --option gtdb \
    --value $HOME/autometa-testing/databases/gtdb

Download/format GTDB databases to configured path

NOTE: $HOME is not actually emitted. I've replaced the actual path with $HOME for security reasons.

(autometa) evan@userserver:~/Autometa$ autometa-update-databases --update-gtdb
[08/04/2022 05:10:33 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_taxdump.py: gtdb_to_taxdump.py https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz --outdir $HOME/autometa-testing/databases/gtdb
2022-08-04 17:10:34,175 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz
2022-08-04 17:10:35,548 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz
2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/names.dmp
2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/nodes.dmp
2022-08-04 17:10:45,451 - File written: $HOME/autometa-testing/databases/gtdb/delnodes.dmp
2022-08-04 17:10:45,452 - File written: $HOME/autometa-testing/databases/gtdb/merged.dmp
[08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: starting GTDB databases download
[08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: wget https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_proteins_aa_reps.tar.gz -O $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz
[08/04/2022 06:11:46 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_diamond.py: gtdb_to_diamond.py $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz $HOME/autometa-testing/databases/gtdb/names.dmp $HOME/autometa-testing/databases/gtdb/nodes.dmp --outdir $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond
2022-08-04 18:11:46,417 - Read nodes.dmp file: $HOME/autometa-testing/databases/gtdb/nodes.dmp
2022-08-04 18:11:46,500 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/nodes.dmp
2022-08-04 18:11:46,500 - Reading dumpfile: $HOME/autometa-testing/databases/gtdb/names.dmp
2022-08-04 18:11:47,561 -   File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/names.dmp
2022-08-04 18:11:47,561 -   No. of accession<=>taxID pairs: 401808
2022-08-04 18:11:47,561 - Extracting tarball: $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz
2022-08-04 18:11:47,561 -   Extracting to: gtdb_to_diamond_TMP
2022-08-04 18:24:08,234 -   No. of .faa(.gz) files: 65703
2022-08-04 18:24:08,256 - Creating accession2taxid table...
2022-08-04 18:24:08,376 -   File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/accession2taxid.tsv
2022-08-04 18:24:08,376 - Formating & merging faa files...
2022-08-04 18:44:45,876 -   File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa
2022-08-04 18:44:45,876 -   No. of seqs. written: 200505361
2022-08-04 18:44:51,908 - Temp-dir removed: gtdb_to_diamond_TMP
[08/04/2022 06:44:51 PM DEBUG] autometa.common.external.diamond: diamond makedb --in $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa --db $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.dmnd -p 96
(autometa) evan@userserver:~/Autometa$

@evanroyrees
Copy link
Collaborator

Looking at submitting a bioconda recipe for gtdb_to_taxdump and in the README it suggests using TaxonKit for retrieving stable taxids across GTDB releases (rather than arbitrarily assigning them as noted). Below is a screenshot from the gtdb_to_taxdump summary section:

Screen Shot 2022-08-09 at 09 34 30


Next Steps Towards Stability

TaxonKit is available via bioconda so would be easy to include in autometa-env.yml and documentation for generating these taxdump files from the GTDB looks straight-forward as outlined in the GTDB-taxdump repo

TaxonKit conda page: https://anaconda.org/bioconda/taxonkit
GTDB-taxdump page (using TaxonKit): https://github.com/shenwei356/gtdb-taxdump

NOTE: TaxonKit v0.12 or greater should be used. (ref)

The taxdump files may also be downloaded directly from the releases page: https://github.com/shenwei356/gtdb-taxdump/releases which reduces the compute requirements.

That being said, if we would like to generate the GTDB taxdump files using this, the steps are outlined here: https://github.com/shenwei356/gtdb-taxdump#steps

Looking in to the future, this is probably the more appropriate route as taxids may be relied up across future GTDB releases.

P.S there is also a related project linked (https://github.com/shenwei356/ictv-taxdump) that generates an NCBI taxdump for viruses and may be useful in the future (@kaw97, is this of any interest to you?)

@chasemc
Copy link
Member

chasemc commented Sep 2, 2022

Not sure if this affects what y'all are doing:
shenwei356/gtdb-taxdump#2

@shenwei356
Copy link

Not sure if this affects what y'all are doing: shenwei356/gtdb-taxdump#2

Only merged.dmp and delnodes.dmp are affected. If you just need to use the taxonomy data of the current version, say R207, don't worry.

@chasemc
Copy link
Member

chasemc commented Sep 7, 2022

Thanks @shenwei356

@Sidduppal
Copy link
Collaborator Author

Sidduppal commented Sep 18, 2022

@WiscEvan The scripts have been updated to use taxonkit's already generated files. I decided to use merged.dmp and delnodes.dmp as they were provided with the taxdump. Although if you think the issue mentioned above by @chasemc would skew the results a LOT we can remove their usage. I'm mentioning the commands for testing:

Setting up the database

  • The taxdump files needs to downloaded and extracted - link
autometa-config     \
--section databases     \
--option gtdb     \
--value /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test
autometa-update-databases --update-gtdb

Run LCA

autometa-taxonomy-lca \
--blast 78mbp_metagenome.blastp.gtdb.tsv --dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--sseqid2taxid-output 78Mbp_sseqid2taxid_test.tsv \
--lca-error-taxids 78Mbp_lcaErrorTaxids_test.tsv \
--verbose \
--lca-output 78Mbp_LCAout_test.tsv

Run Majority vote

autometa-taxonomy-majority-vote \
--lca 78Mbp_LCAout_test.tsv \
--output 78Mbp_gtdb_majority_vote.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \ 
--verbose \
--dbtype gtdb

Run taxonomy

autometa-taxonomy \
--votes 78Mbp_gtdb_majority_vote.tsv \
--assembly /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna \
--output testTaxonomy \
--split-rank-and-write superkingdom \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb

Run summary

autometa-binning-summary \
--binning-main 78_binningMain_gtdb.tsv \
--markers /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.markers.tsv \
--dbdir /media/bigdrive1/sidd/autometa_aim1_1/data/external/gtdbData_r207v2/test1/taxonKit/gtdb-taxdump/R207/test \
--dbtype gtdb \
--output-stats binningSummartStats.tsv --output-taxonomy binningTaxa.tsv \
--output-metabins metaBins \
--metagenome /media/bigdrive1/sidd/nextflow_trial/autometa_runs/78mbp_manual/interim/78mbp_metagenome.filtered.fna

By using this we don't need to add any additional dependency as well. Let me know what you think.

Copy link
Collaborator

@evanroyrees evanroyrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It does not appear that you are using delnodes.dmp in gtdb.py other than reading the file.
  2. It does not appear that you are using merged.dmp in gtdb.py other than reading the file.
  3. Tests are currently failing and should be double-checked

There are some other comments attached. I'll review more after these are addressed. 👍 🚧 👷

autometa/config/default.config Outdated Show resolved Hide resolved
autometa/config/databases.py Outdated Show resolved Hide resolved
autometa/config/databases.py Outdated Show resolved Hide resolved
autometa/config/databases.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
Comment on lines 63 to 67
for dirpath, dirs, files in os.walk(reps_faa):
for file in files:
if file.endswith("_protein.faa"):
faa_file = os.path.join(dirpath, file)
# Regex to get the genome accession from the file path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than walking with multiple for loops, why not just glob _protein.faa?

genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)

faa_index = {
    re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", fpath).group() : fpath
    for fpath in glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)
}
from typing import Dict
genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)
faa_index: Dict[str, str] = {}
for genome_protein_faa_filepath in genome_protein_faa_filepaths:
    genome_acc_search = re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", genome_protein_faa_filepath)
    if not genome_acc_search:
        continue
    genome_accession = genome_acc_search.group()
    faa_index[genome_accession] = genome_protein_faa_filepath

autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
Copy link
Collaborator

@evanroyrees evanroyrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure why you are running in to an error with the autometa-large-data-mode.sh workflow. From inspecting your output directory, it looks like only 11 contigs were classified as bacteria, so maybe this is causing the issue? Large data mode was not intended for 11 contigs

cache="${outdir}/${simpleName}_${kingdom}_cache"
output_binning="${outdir}/${simpleName}.${kingdom}.${clustering_method}.tsv"
output_main="${outdir}/${simpleName}.${kingdom}.${clustering_method}.main.tsv"
cache="${outdir}/${simple_name}_${kingdom}_cache"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not a typo, this should read out outdir/sample_name_gtdb_bacteria_cache or outdir/sample_name_ncbi_bacteria_cache

Suggested change
cache="${outdir}/${simple_name}_${kingdom}_cache"
cache="${outdir}/${simple_name}_${dbtype}_${kingdom}_cache"

echo "Running GTDB taxon assignment step."

# output
gtdb_orfs="${outdir}/${prefix}.orfs.faa"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think gtdb_orfs is misleading? These are bacterial and archael ORFs identified against NCBI's nr database. So maybe gtdb_input_orfs or something else that would be more appropriate? What do you think?

Suggested change
gtdb_orfs="${outdir}/${prefix}.orfs.faa"
gtdb_input_orfs="${outdir}/${prefix}.orfs.faa"

# $blast --> Generated from step 5.2
# $gtdb --> User Input
# $dbtype --> Updated according to $taxa_routine
dbtype="gtdb"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbtype="gtdb" is also on line 247. Is this necessary, or can we delete this?

Comment on lines 319 to 321
# e.g. ${outdir}/${prefix}.gtdb.bacteria.fna
# e.g. ${outdir}/${prefix}.gtdb.archaea.fna
# e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtdb, i.e. $dbtype is included in $prefix

Suggested change
# e.g. ${outdir}/${prefix}.gtdb.bacteria.fna
# e.g. ${outdir}/${prefix}.gtdb.archaea.fna
# e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv
# e.g. ${outdir}/${prefix}.bacteria.fna
# e.g. ${outdir}/${prefix}.archaea.fna
# e.g. ${outdir}/${prefix}.taxonomy.tsv


kingdom_fasta="${outdir}/${prefix}.${kingdom}.fna"

contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writes sequence headers with the _ suffix. This should be more appropriately named {...}.orfPrefixes.txt and the variable name changed to orf_id_prefixes

Suggested change
contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt"
orf_prefixes="${outdir}/${prefix}.${kingdom}.orfPrefixes.txt"

NOTE: contig_ids will need to be changed to orf_prefixes where appropriate as well

sed 's/$/_/' | \
cut -f1 -d" " > $contig_ids
# Retrieve ORF IDs from contig IDs
grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: See comment above

Suggested change
grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids
grep -f $orf_prefixes $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids

# Step 5.2: Run blastp
# input:
# $gtdb_input_orfs --> Generated from step 5.1
gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script)
Copy link
Collaborator

@evanroyrees evanroyrees Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using find here on CHTC, which may be a little more robust. Do you think this is appropriate?

Suggested change
gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script)
gtdb_dmnd_db=$(find $gtdb -name "gtdb.dmnd") # generated using autometa-setup-gtdb (Must be performed prior to using this script)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. I can do that.

📝 Fix incorrect unclustered.fasta path
Copy link
Collaborator

@evanroyrees evanroyrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. 👍 Release forthcoming

@evanroyrees evanroyrees merged commit 9772257 into dev Dec 16, 2022
@evanroyrees evanroyrees deleted the gtdb_to_autometa branch December 20, 2022 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants