Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long InChI codes crash the database refresh #401

Open
meowcat opened this issue Jan 31, 2024 · 8 comments
Open

Long InChI codes crash the database refresh #401

meowcat opened this issue Jan 31, 2024 · 8 comments

Comments

@meowcat
Copy link
Contributor

meowcat commented Jan 31, 2024

For records with very long InChI codes, the importer doesn't fail gracefully. No validation problems are encountered, but the import crashes while trying to write the InChI code to the DB. As a result, zero records end up in the DB. CH_IUPAC is a VARCHAR(1200).

Expected behaviour:

  1. the validator should catch the problem (though it is strictly speaking debatable because per se the MassBank record spec doesn't specify a maximal length)
  2. the database import should skip the problematic records
[+] Creating 1/0
 ✔ Container 5-mariadb-1  Running                                                                                                                                                                         0.0s 
RefreshDatabase version: 2.2.6-SNAPSHOT
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
5 records to send to database. 80% Done.java.sql.SQLSyntaxErrorException: (conn=17159) Data too long for column 'CH_IUPAC' at row 1
        at org.mariadb.jdbc.export.ExceptionFactory.createException(ExceptionFactory.java:282)
        at org.mariadb.jdbc.export.ExceptionFactory.create(ExceptionFactory.java:370)
        at org.mariadb.jdbc.message.ClientMessage.readPacket(ClientMessage.java:134)
        at org.mariadb.jdbc.client.impl.StandardClient.readPacket(StandardClient.java:883)
        at org.mariadb.jdbc.client.impl.StandardClient.readResults(StandardClient.java:822)
        at org.mariadb.jdbc.client.impl.StandardClient.readResponse(StandardClient.java:741)
        at org.mariadb.jdbc.client.impl.StandardClient.execute(StandardClient.java:665)
        at org.mariadb.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:92)
        at org.mariadb.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:337)
        at org.mariadb.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:314)
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
        at massbank.db.DatabaseManager.persistAccessionFile(DatabaseManager.java:326)
        at massbank.cli.RefreshDatabase.lambda$main$0(RefreshDatabase.java:67)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Find attached a record set of five records where one causes this problem. Note: This is a work in progress dataset used in-house and derived from Florian Huber's dataset https://zenodo.org/records/10160791 (I hope this note and the CC BY in the records fulfill the CC BY requirements...)
records.tar.gz

@sneumann
Copy link
Member

sneumann commented Jan 31, 2024

Do we have any idea what is the longest InChI possible ? Todays InChI is defined as max, 1000 atoms, I read somewhere about extension to 65K. The longest in
https://zenodo.org/record/6503754/files/PubChemLite_exposomics_20220429.csv
has a maximum length of 3593 for a DNA snippet with 873 atoms: DFYPFJSPLUVPFJ-QJEDTDQSSA-N

@schymane
Copy link
Member

How is that InChIKey valid? It has too many sections? (copy paste issue?)

The URL redirects OK tho (DFYPFJSPLUVPFJ-QJEDTDQSSA-N)

I thought we trimmed PCL to ~2000 but it seems that's sneaking through (MW 8000)?
It is only in PCL due to this small bit of annotation:
https://pubchem.ncbi.nlm.nih.gov/compound/DFYPFJSPLUVPFJ-QJEDTDQSSA-N#section=Drug-and-Medication-Information

@PaulThiessen might be able to answer the InChI length question for you, I am not sure ...

@PaulThiessen
Copy link

I'm not actually sure about atom limits in regular InChI, but PubChem has a limit of 999 atoms (including H) for compounds (historically because that's the limit of the MOL/SDF V2000 format).

I don't think there's any particular length limit for the full InChI string. The longest one in PubChem is 4789 characters (CID 160332983).

@sneumann
Copy link
Member

sneumann commented Feb 1, 2024

Indeed the visible InChIkey was cut&paste leftover. Fixed now.
The InChI specs mention a limit of 1024 atoms on p18.
https://www.inchi-trust.org/download/104/InChI_UserGuide.pdf
Yours, Steffen

@schymane
Copy link
Member

schymane commented Feb 1, 2024

That number is surely not coincidental ... @PaulThiessen do you know if that changed in more recent versions (that documentation was 1.04, you're now on 1.06 or 1.07 right?). I never get those log files when generating InChIs ...

image

@PaulThiessen
Copy link

We're using 1.06, although 1.07 is in the works and will be out soon. I'll ask the InChI folks directly what the current atom limit is.

@PaulThiessen
Copy link

Ok yes standard InChI in current versions still has a limit of 1024 atoms.

@schymane
Copy link
Member

schymane commented Feb 2, 2024

Thanks Paul!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants