Regarding the PDF layering issue #143643

Hzjboss404 · 2024-11-05T12:36:38Z

Hzjboss404
Nov 5, 2024

Select Topic Area

Product Feedback

Body

Regarding the PDF layering issue, I want to perform OCR on a PDF without converting the original PDF into an image. The original PDF uses vector fonts which are infinitely clear, but I can only select one character at a time and cannot search. I aim to overlay the OCR text on the original PDF without converting it to an image. The mentioned software, UMI-OCR, skips over vector fonts, but what I need is to recognize vector fonts as if they were images and then overlay a text layer on the original PDF. This way, I can preserve the clarity of the vector fonts while also gaining search functionality.

LiteBrite82 · 2024-11-05T17:52:44Z

LiteBrite82
Nov 5, 2024
Maintainer

Thanks for posting in the GitHub Community, @Hzjboss404 !

We’ve moved your post to our Programming Help 🧑‍💻 category, which is more appropriate for this type of discussion.

Please review our guidelines about the Programming Help category for more information.

0 replies

collinpksmith · 2024-11-05T17:59:07Z

collinpksmith
Nov 5, 2024

Hi @Hzjboss404
Hope you are doing well.

Here are some of my solutions to solve that issue.

OCRmyPDF is a command-line tool designed specifically for adding OCR layers to PDFs without converting them into images. It preserves the original PDF content, including vector graphics and fonts, while adding a hidden text layer that enables search functionality.

Install OCRmyPDF: You can install OCRmyPDF using pip:
pip install ocrmypdf
Run OCR with OCRmyPDF: OCRmyPDF includes options to skip rasterizing vector content while still performing OCR on the visible text. This allows it to recognize text from vector fonts and overlay it as a hidden text layer on top of the original vector-based PDF.
ocrmypdf --skip-text --force-ocr original.pdf output.pdf
--skip-text: Tells OCRmyPDF to ignore text that it can already recognize, which allows it to focus on any unselectable vector text.
--force-ocr: Forces OCR processing on every page, even if OCRmyPDF detects text layers (this is helpful for ensuring vector text gets recognized).
Output PDF with Searchable Text Layer: The output.pdf generated by OCRmyPDF will retain the original vector clarity while adding a hidden text layer, making the document searchable and selectable.

Hope to be helpful for you!
If then, please mark this answer as done~~

Thank you

0 replies

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Regarding the PDF layering issue #143643

{{title}}

Replies: 3 comments

This comment was marked as off-topic.

{{title}}

{{title}}

Select a reply

GitHub Community

Regarding the PDF layering issue #143643

Hzjboss404 Nov 5, 2024

Select Topic Area

Body

Replies: 3 comments

This comment was marked as off-topic.

LiteBrite82 Nov 5, 2024 Maintainer

collinpksmith Nov 5, 2024

Hzjboss404
Nov 5, 2024

LiteBrite82
Nov 5, 2024
Maintainer

collinpksmith
Nov 5, 2024