PDF hard line breaks split DOIs #1

vaneseltine · 2024-07-26T16:06:18Z

This is likely a large problem with PDF parsing.

We have two xfails in the test suite that address this:

    @pytest.mark.xfail(reason="Need to check for line breaks")
    def test_line_break_shortens_doi_in_pdf(self, vault):
        """
        We get 10.3390/v130 instead
        """
        paper = Paper.from_path(vault["10.3389.fcvm.2021.745758.pdf"])
        assert "10.3390/v13040700" in paper.dois

    @pytest.mark.xfail(reason="Line breaks again")
    def test_line_break_obscures_doi_in_pdf(self, vault):
        """
        This one isn't detected at all
        """
        paper = Paper.from_path(vault["42-1-orig_article_Cagney.pdf"])
        assert "10.1016/S0140-6736(14)61033-3" in paper.dois

vaneseltine · 2024-07-26T16:19:39Z

A few things to consider:

Extracting embedded URLs

The full DOI might be recovered from a working embedded URL (e.g., 42-1-orig_article_Cagney.pdf includes them) and retrievable through more digging in the PDF.

Removing line breaks

If we're hitting the Crossref API, and we identify a too-short DOI (test_line_break_shortens_doi_in_pdf), we can try again to find a real DOI:

API check the DOI.
If it's real, great, check against RW.
If not, remove the next whitespace, concatenate to form a new DOI, repeat once maybe twice.

There are also cases where we the line break damages the DOI (test_line_break_obscures_doi_in_pdf) enough that the regex patterns don't catch it at all:

Search the document for doi.org and doi:
As above, read whitespace-separated chunks (within reason) until we get a DOI.

vaneseltine · 2024-07-27T16:42:36Z

Oh -- first check against RW DB, then against Crossref API. RW DOIs are all currently regex valid but not all are Crossref valid. That might be a bigger problem but nevertheless.

vaneseltine changed the title ~~PDF hard line breaks splitting DOIs~~ PDF hard line breaks split DOIs Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF hard line breaks split DOIs #1

PDF hard line breaks split DOIs #1

vaneseltine commented Jul 26, 2024 •

edited

Loading

vaneseltine commented Jul 26, 2024

vaneseltine commented Jul 27, 2024

PDF hard line breaks split DOIs #1

PDF hard line breaks split DOIs #1

Comments

vaneseltine commented Jul 26, 2024 • edited Loading

vaneseltine commented Jul 26, 2024

Extracting embedded URLs

Removing line breaks

vaneseltine commented Jul 27, 2024

vaneseltine commented Jul 26, 2024 •

edited

Loading