OCR-search

Searching for a text in scanned files using OCR.
If the document has a table, then the program determines the structure of the table and recognizes the contents.
The original document (image/PDF) can be rotated to any angle.
The search is also performed in files of other formats, if these files are present in the target directory.
Supported formats: pdf/png/jpeg/jpg/tif/doc/docx/odt/xls/xlsx/ods/txt
Works on Windows, Linux

Use

The program starts without any parameters. Uses settings from config file.
You need to set at least one parameter in search.conf: the 'search' parameter specifies what to search for.
If you do not specify a search directory, the current directory will be used.

Prerequisites

for Windows10 (Linux):

python 3.9 (or later)
tesseract-ocr-w64-setup-v5.2.0.20220712.exe
2.1. PATH environment variable: "D:\Tesseract" (specify the directory selected during installation)
2.2. Add file with prefered language (ukr.traineddata) to "D:\Tesseract\tessdata"
(for Linux: apt-get install tesseract-ocr)
pip install pytesseract
poppler and pdf2image
4.1. install Poppler (release 22.12.0-0 or later) for Windows: https://github.com/oschwartz10612/poppler-windows/releases
Extract Poppler to "D:\Poppler\poppler-22.12.0" (or any other folder) and add PATH:
"D:\Poppler\poppler-22.12.0\Library\bin"
4.2. pip install pdf2image
pip install filetype
for 'XLS/XLSX' file format:
pip install pandas
pip install xlrd
for 'DOCX' file format:
pip install docx2python
pip install opencv
for 'DOC' file format:
Download and unpack antiword to c:\antiword.
Set the PATH and add Environment Variable:
ANTIWORDHOME = c:\antiword
PATH = c:\antiword (for Linux: apt-get install antiword)
pip install progress
for 'ods/odt' file format:
pip install odfpy

Run

For Windows: to generate exe file:

pyinstaller.exe --onefile search.py
search.exe

or run:

search.py

Example

Versioning

v3.3

Author

Svitlana Viblaia

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
mds		mds
screenshot		screenshot
LICENSE		LICENSE
README.md		README.md
search.conf		search.conf
search.log		search.log
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-search

Use

Prerequisites

for Windows10 (Linux):

Run

Example

Versioning

Author

License

About

Releases

Packages

Languages

License

svitlana1209/OCR-search

Folders and files

Latest commit

History

Repository files navigation

OCR-search

Use

Prerequisites

for Windows10 (Linux):

Run

Example

Versioning

Author

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages