-
Notifications
You must be signed in to change notification settings - Fork 23
1. Synthetic Text Preparation
Sanskrit text lines can be obtained from https://sanskritdocuments.org/.66 fonts(that correctly inherit conjoined characters rule) were chosen to render the lines. This was done to ensure variability in the dataset and also helps the model generalize to real data.
To get the data, go to this.
conda create python=3.7 --name OCR_ENV
conda activate OCR_ENV
conda install pip
pip install -r requirements.txt
For a quick access to conda related commands, visit conda's cheet sheet
Extract sanskritdoc.zip present in data_preparation/synthetic folder to get synthetic lines.
unzip data_preparation/synthetic/sanskritdoc.zip -d data_preparation/synthetic/
Extract fonts.zip folder present in the data_preparation/synthetic folder. It contains the 67 fonts which are used in our project.
To install the fonts, run the following commands:
unzip data_preparation/synthetic/fonts.zip -d data_preparation/synthetic/
mkdir ~/.fonts
cp data_preparation/synthetic/fonts/* ~/.fonts
sudo fc-cache -fv
To allow ASCII text reading:
sudo update-locale LANG=en_US.UTF-8
Resources:
To render the fonts, run:
python2 prep_scripts/get_random_lines.py
The script will randomly select 5000 lines for each font from data_preparation/synthetic/sanskritdoc.txt (text available at SanskritDocuments) and save images in their respective font folders in the line_images
directory.