Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF hard line breaks split DOIs #1

Open
vaneseltine opened this issue Jul 26, 2024 · 2 comments
Open

PDF hard line breaks split DOIs #1

vaneseltine opened this issue Jul 26, 2024 · 2 comments

Comments

@vaneseltine
Copy link
Owner

vaneseltine commented Jul 26, 2024

This is likely a large problem with PDF parsing.

We have two xfails in the test suite that address this:

image

    @pytest.mark.xfail(reason="Need to check for line breaks")
    def test_line_break_shortens_doi_in_pdf(self, vault):
        """
        We get 10.3390/v130 instead
        """
        paper = Paper.from_path(vault["10.3389.fcvm.2021.745758.pdf"])
        assert "10.3390/v13040700" in paper.dois

image

    @pytest.mark.xfail(reason="Line breaks again")
    def test_line_break_obscures_doi_in_pdf(self, vault):
        """
        This one isn't detected at all
        """
        paper = Paper.from_path(vault["42-1-orig_article_Cagney.pdf"])
        assert "10.1016/S0140-6736(14)61033-3" in paper.dois
@vaneseltine vaneseltine changed the title PDF hard line breaks splitting DOIs PDF hard line breaks split DOIs Jul 26, 2024
@vaneseltine
Copy link
Owner Author

A few things to consider:

Extracting embedded URLs

The full DOI might be recovered from a working embedded URL (e.g., 42-1-orig_article_Cagney.pdf includes them) and retrievable through more digging in the PDF.

Removing line breaks

If we're hitting the Crossref API, and we identify a too-short DOI (test_line_break_shortens_doi_in_pdf), we can try again to find a real DOI:

  • API check the DOI.
  • If it's real, great, check against RW.
  • If not, remove the next whitespace, concatenate to form a new DOI, repeat once maybe twice.

There are also cases where we the line break damages the DOI (test_line_break_obscures_doi_in_pdf) enough that the regex patterns don't catch it at all:

  • Search the document for doi.org and doi:
  • As above, read whitespace-separated chunks (within reason) until we get a DOI.

@vaneseltine
Copy link
Owner Author

Oh -- first check against RW DB, then against Crossref API. RW DOIs are all currently regex valid but not all are Crossref valid. That might be a bigger problem but nevertheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant