Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

bloodgroup-cplusplus · 2023-11-02T19:25:21Z

All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.

Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

bloodgroup-cplusplus commented Nov 2, 2023

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

Comments

bloodgroup-cplusplus commented Nov 2, 2023