Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BLAST results protein to taxonomic accession assignment #317

Open
chasemc opened this issue Mar 30, 2023 · 1 comment
Open

Fix BLAST results protein to taxonomic accession assignment #317

chasemc opened this issue Mar 30, 2023 · 1 comment

Comments

@chasemc
Copy link
Member

chasemc commented Mar 30, 2023

Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions.
Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified.
Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.

image

Assignment to root that needs to be changed:

# If we still have missing taxids, we will set the sseqid value to the root taxid
# fill missing taxids with root_taxid
sseqid_to_taxid_df["cleaned_taxid"] = sseqid_to_taxid_df.merged_taxid.fillna(
root_taxid
)

@chasemc chasemc changed the title Fix BLAST results protein to taconomic accession assignment Fix BLAST results protein to taxonomic accession assignment Mar 30, 2023
@evanroyrees
Copy link
Collaborator

evanroyrees commented Mar 31, 2023

This could be handled in multiple areas. Either specifically when parsing taxids for NCBI or at the LCA step. Any preference?

for qseqid, qseqid_taxids in taxids.items():
# This will convert any deprecated taxids to their current taxids before reduction to LCA.
# NOTE: we only want to retrieve the set of unique taxids (not a list) for LCA query
lca_taxid = False
num_taxids = len(qseqid_taxids)
while not lca_taxid:
# Reduce all qseqid taxids to LCA
if num_taxids >= 2:
try:
lca_taxid = functools.reduce(
lambda taxid1, taxid2: self.lca(node1=taxid1, node2=taxid2),
qseqid_taxids,
)
except ValueError as err:
logger.warn(
f"Missing taxids during LCA retrieval. Setting {qseqid} ({len(qseqid_taxids):,} ORF taxids) to root taxid."
)
lca_taxid = root_taxid
missing_taxids.append(
{
"qseqid": qseqid,
"taxids": ",".join(
[str(taxid) for taxid in qseqid_taxids]
),
}
)
# If only one taxid was recovered... This is our LCA
elif num_taxids == 1:
lca_taxid = qseqid_taxids.pop()
else:
# Exception handling where input for qseqid contains no taxids e.g. num_taxids == 0:
lca_taxid = root_taxid
lca_taxids.update({qseqid: lca_taxid})
missing_df = pd.DataFrame(missing_taxids)
return lca_taxids, missing_df

# discard root taxid from set of query's taxids before #L370
qseqid_taxids.discard(root_taxid)
# or something like this?
from autometa.taxonomy.database import TaxonomyDatabase
qseqid_taxids.discard(TaxonomyDatabase.UNCLASSIFIED_TAXID)

while not lca_taxid:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants