Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES / InChI(Key)+identifier inconsistencies in RMassBank-generated records #331

Open
schymane opened this issue May 17, 2023 · 3 comments
Labels

Comments

@schymane
Copy link
Member

Hi @meowcat @meier-rene (CC @anjuraj15 and @PaulThiessen)

We had a bizarre case of existing (3 year old) ENTACT records fail validation when we updated only unrelated (textual) information. Turns out the SMILES contained stereochemistry information, but the InChI, InChIKey and all related identifiers didn't, which then failed @meier-rene 's updated validation suite.

Here are the SMILES in question:

ClC1=CC=C(CN2CCS\C2=N/C#N)C=N1
CN(C)C1=CC=C(C=C1)\N=N\C1=C(C=CC=C1)C(O)=O
OC(=O)C1=CC(=CC=C1O)\N=N\C1=CC=C(C=C1)S(=O)(=O)NC1=NC=CC=C1
CCCOC\C(=N/C1=C(C=C(Cl)C=C1)C(F)(F)F)N1C=CN=C1
NC1=CC=C(C=C1)\N=N\C1=CC=CC=C1

Turns out that they standardize to the non-stereochemistry form in PubChem standardizer, and presumably also Cactvs - which may explain how everything after InChIKey ended up as the "stereochemistry-neutral" form. The only way we could get these records to pass validation was to adjust to the non-stereo SMILES, rather than having to update all InChI and identifier fields. See example before and after change (after with _ES and end) and the log.

Not sure if we have to build a check into RMassBank to catch this, @meowcat have you ever seen any cases like this? @meier-rene are there any other existing records that have this issue?

log.txt
MSBNK-LCSB-LU005205.txt
MSBNK-LCSB-LU005205_ES.txt

@meier-rene
Copy link

Hi @schymane,
We have several thousands of these mismatches in our data at MassBank. Its not trivial to fix and requires manual work in most cases. That's why I silently accept this error in existing records but try to prevent new records with this problem from entering our collection. There is a whitelist for existing records to pass validation if they have this particular issue.

Your new contribution is a good opportunity to solve it for LCSB data. For the LCSB contributions its really just the 5 compounds you listed. It seems to be related to cis/trans imine or diimine and in general they are not stable and undergo slow conversion. Its questionable if these spectra should be annotated with one particular isomer form my point of view. In general I would support to remove cis trans information from these records. I will look into this in detail.

@schymane
Copy link
Member Author

OK great, this would explain it.

For the LCSB records we removed the stereochemistry information in that commit cross-referenced above, which would agree with both your general reasoning and the PubChem / CACTVS behaviour. Seems like the right solution overall for now.

Let me know if we should take a look at this for the other mismatches in MassBank, this will also affect which records end up being annotated with the spectra in PubChem ...

@tsufz tsufz added this to the Bugs and warnings milestone May 23, 2023
@tsufz tsufz added the bug label May 23, 2023
@meowcat
Copy link

meowcat commented May 30, 2023

Hi,
in fact we have a similar issue for new records. The issue is broader anyway, since I am never quite sure what stereochemistry to include in records. Frequently I defaulted to wiping out stereochemistry at the SMILES level and not claiming that the spectrum is related to any specific stereoisomer. I think this is the right thing to do for molecules with one stereocenter; but diastereomers (especially natural products with many stereocenters) might be distinguishable by LC and perhaps even in some cases by MS2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants