-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple problems with the indexcard output format #723
Comments
This is a format that really nobody uses anymore. I propose to remove it. |
Yes. Not having looked at it yet, I wonder whether it is easier to fix than to remove. Last time something was removed, it had to be added back. However, I can certainly follow instructions |
Either way... |
Here are some more details about the exceptions that were thrown. Some don't seem connected to the output but were problems encountered before the output, which I call reading here. Some seem to have been for non-indexcard formats.
|
Hmm. Some of these seem errors in the format code. Some are legit exceptions in the data that should be handled. Thanks! |
If I had had the corresponding files, I would have already volunteered to do it. Feel free to reassign. |
I updated the table with a reference to a file for each error kind. The attached zip contains all the referenced input files. I'll get some example sentences for each error.
|
Thank you. I'll get to them soon. |
Thanks @kwalcock. Some comments: The errors referred by the rows with N/A in the sentence column are not triggered by a sentence, but by the assembly procedure, which I believe is a form of aggregation of multiple interactions. The corresponding documents trigger the error. |
I'll update this as they are figured out.
|
This java.lang.NegativeArraySizeException for PMC7176272 is very suspicious. It doesn't occur anywhere near any of our code that could be subtracting wrong. The input file of 300KB takes a very, very long time to process. My computer ran overnight and I see in the log that Enrique worked on it for 22 hours. When I paused it periodically I noticed that the stack was very, very long. It looked like there was about one stack frame for every single one of some 4000+ mentions and it was building up some monster json structure. I couldn't easily tell if there was some kind of loop, but I wonder if there are some Mentions linked to each other in a circle. In generating the output there are buffers involved which are resizing. If something is resized to Integer.MAX_VALUE + 1, which is only 2,147,483,648 or 2GB, this exception can be thrown. I think the program is trying to build 2GB of json output in a string. It might take all night to do that. Has something like this happened before? I'll accept hints that anyone can offer before looking again. |
What you say sounds plausible and I think that this is a corner case too bizarre, so probably it's not worth fixing. We can instead keep this note in a "Knowledge Base" somewhere in the wiki in case it happens again eventually. |
I haven't yet noticed in the serialization code anything that is looking out for loops, like a list of already visited Mentions being passed around. Perhaps a short unit test can at least show what would result if that were ever to happen. |
I have seen this in the past, but very infrequently... |
There are a few bugs in the indexcard output format which I think are because of changes made to data structures posterior to the creation of the output format.
Below are all the exceptions that appear in the log after a few days running.
Please find the log file and a couple papers to reproduce it attached to the issue.
error.log
PMC4543788.nxml.txt
PMC5809884.nxml.txt
PMC6086911.nxml.txt
@MihaiSurdeanu Since this output format doesn't seem relevant today, are the errors worth fixing?
The text was updated successfully, but these errors were encountered: