Searching for a text in scanned files using OCR.
If the document has a table, then the program determines the structure of the table and recognizes the contents.
The original document (image/PDF) can be rotated to any angle.
The search is also performed in files of other formats, if these files are present in the target directory.
Supported formats: pdf/png/jpeg/jpg/tif/doc/docx/odt/xls/xlsx/ods/txt
Works on Windows, Linux
The program starts without any parameters. Uses settings from config file.
You need to set at least one parameter in search.conf: the 'search' parameter specifies what to search for.
If you do not specify a search directory, the current directory will be used.
- python 3.9 (or later)
- tesseract-ocr-w64-setup-v5.2.0.20220712.exe
2.1. PATH environment variable: "D:\Tesseract" (specify the directory selected during installation)
2.2. Add file with prefered language (ukr.traineddata) to "D:\Tesseract\tessdata"
(for Linux: apt-get install tesseract-ocr) - pip install pytesseract
- poppler and pdf2image
4.1. install Poppler (release 22.12.0-0 or later) for Windows: https://github.com/oschwartz10612/poppler-windows/releases
Extract Poppler to "D:\Poppler\poppler-22.12.0" (or any other folder) and add PATH:
"D:\Poppler\poppler-22.12.0\Library\bin"
4.2. pip install pdf2image - pip install filetype
- for 'XLS/XLSX' file format:
pip install pandas
pip install xlrd - for 'DOCX' file format:
pip install docx2python - pip install opencv
- for 'DOC' file format:
Download and unpack antiword to c:\antiword.
Set the PATH and add Environment Variable:
ANTIWORDHOME = c:\antiword
PATH = c:\antiword (for Linux: apt-get install antiword) - pip install progress
- for 'ods/odt' file format:
pip install odfpy
For Windows: to generate exe file:
pyinstaller.exe --onefile search.py
search.exe
or run:
search.py
v3.3
- Svitlana Viblaia
This project is licensed under the MIT License.