Skip to content

1. Synthetic Text Preparation

khadiravana-belagavi edited this page Jan 16, 2021 · 5 revisions

Data Preparation

1. Synthetic Text Preparation

Synthetic Text Preparation

Obtain Synthetic text

Sanskrit text lines can be obtained from https://sanskritdocuments.org/.66 fonts(that correctly inherit conjoined characters rule) were chosen to render the lines. This was done to ensure variability in the dataset and also helps the model generalize to real data.

To get the data, go to this.

Setting Up Your Environment

conda create python=3.7 --name OCR_ENV

conda activate OCR_ENV

conda install pip

pip install -r requirements.txt

For a quick access to conda related commands, visit conda's cheet sheet

Get Sanskrit text lines

Extract sanskritdoc.zip present in data_preparation/synthetic folder to get synthetic lines.

unzip data_preparation/synthetic/sanskritdoc.zip -d data_preparation/synthetic/

Get fonts

Extract fonts.zip folder present in the data_preparation/synthetic folder. It contains the 67 fonts which are used in our project.

To install the fonts, run the following commands:

unzip data_preparation/synthetic/fonts.zip -d data_preparation/synthetic/

mkdir ~/.fonts

cp data_preparation/synthetic/fonts/* ~/.fonts

sudo fc-cache -fv

To allow ASCII text reading:

sudo update-locale LANG=en_US.UTF-8

Resources:

Top 50 hindi fonts

Google fonts

Shobhika font

Font rendering for Synthetic data

To render the fonts, run:

python2 prep_scripts/get_random_lines.py

The script will randomly select 5000 lines for each font from data_preparation/synthetic/sanskritdoc.txt (text available at SanskritDocuments) and save images in their respective font folders in the line_images directory.