You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.
Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.
The text was updated successfully, but these errors were encountered:
All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.
Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.
The text was updated successfully, but these errors were encountered: