You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have two xfails in the test suite that address this:
@pytest.mark.xfail(reason="Need to check for line breaks")deftest_line_break_shortens_doi_in_pdf(self, vault):
""" We get 10.3390/v130 instead """paper=Paper.from_path(vault["10.3389.fcvm.2021.745758.pdf"])
assert"10.3390/v13040700"inpaper.dois
@pytest.mark.xfail(reason="Line breaks again")deftest_line_break_obscures_doi_in_pdf(self, vault):
""" This one isn't detected at all """paper=Paper.from_path(vault["42-1-orig_article_Cagney.pdf"])
assert"10.1016/S0140-6736(14)61033-3"inpaper.dois
The text was updated successfully, but these errors were encountered:
vaneseltine
changed the title
PDF hard line breaks splitting DOIs
PDF hard line breaks split DOIs
Jul 26, 2024
The full DOI might be recovered from a working embedded URL (e.g., 42-1-orig_article_Cagney.pdf includes them) and retrievable through more digging in the PDF.
Removing line breaks
If we're hitting the Crossref API, and we identify a too-short DOI (test_line_break_shortens_doi_in_pdf), we can try again to find a real DOI:
API check the DOI.
If it's real, great, check against RW.
If not, remove the next whitespace, concatenate to form a new DOI, repeat once maybe twice.
There are also cases where we the line break damages the DOI (test_line_break_obscures_doi_in_pdf) enough that the regex patterns don't catch it at all:
Search the document for doi.org and doi:
As above, read whitespace-separated chunks (within reason) until we get a DOI.
Oh -- first check against RW DB, then against Crossref API. RW DOIs are all currently regex valid but not all are Crossref valid. That might be a bigger problem but nevertheless.
This is likely a large problem with PDF parsing.
We have two xfails in the test suite that address this:
The text was updated successfully, but these errors were encountered: