Treat U_ARABIC_NUMBER as LTR #2270

Shreeshrii · 2019-02-26T10:02:50Z

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics#numbers

Numbers in RTL scripts run left-to-right within the right-to-left flow.

See earlier discussion in the closed PR - #2266

The change to ccutil/unicharset.cpp which was in earlier PR is NOT included, since U_ARABIC_NUMBER was being used there as part of condition to find if right_to_left scripts are significant in the unicharset.

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccutil/unicharset.cpp#L956

Shreeshrii · 2019-02-28T08:25:52Z

@zdenop Please merge this.

Test Case:

Image: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.png

Ground Truth: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.testdeco.gt.txt

Traineddata (Replace a layer training): https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara-Amiri-layer.traineddata

OCRed text - psm 6: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.txt

For comparison, results with tessdata_best and tessdata_fast are linked below.
Those traineddatas only recognize zero and one as numerals in Arabic script - ١ ٠
They also do not recognize the Arabic punctuation marks, decorative parenthsis and commonly used ligature which was requested in Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131

OCR with tessdata_best: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer-best.txt

OCR with tessdata_fast: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer-fast.txt

According to @amitdo the text files display correctly in RTL format in Firefox.

I have tested with numbers with numerals in Arabic script at beginning, end and middle of lines. As seen in earlier results TOC with numbers at end was displaying correctly. This sample shows correct recognition and no reversal at end and middle.

Where a line starts with a number, some digits are being dropped - this was something I had noticed in a different issue with Hebrew example also. See #2263 (comment)
In the second line of the example in Hebrew, line begins with 28, only 8 is recognized. This should probably be opened as a separate issue.

Shreeshrii · 2019-02-28T08:27:41Z

FYI - see #2266 for earlier version of PR and related discussion.

Treat U_ARABIC_NUMBER as LTR

25b02bf

zdenop merged commit d7ddc4c into tesseract-ocr:master Feb 28, 2019

Shreeshrii mentioned this pull request Feb 28, 2019

Numbers in Arabic script are getting reversed #2263

Closed

Shreeshrii deleted the U_ARABIC_NUMBER branch March 1, 2019 13:19

amitdo added the RTL label Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat U_ARABIC_NUMBER as LTR #2270

Treat U_ARABIC_NUMBER as LTR #2270

Shreeshrii commented Feb 26, 2019

Shreeshrii commented Feb 28, 2019 •

edited

Loading

Shreeshrii commented Feb 28, 2019

Treat U_ARABIC_NUMBER as LTR #2270

Treat U_ARABIC_NUMBER as LTR #2270

Conversation

Shreeshrii commented Feb 26, 2019

Shreeshrii commented Feb 28, 2019 • edited Loading

Shreeshrii commented Feb 28, 2019

Shreeshrii commented Feb 28, 2019 •

edited

Loading