Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat U_ARABIC_NUMBER as LTR #2270

Merged
merged 1 commit into from
Feb 28, 2019
Merged

Treat U_ARABIC_NUMBER as LTR #2270

merged 1 commit into from
Feb 28, 2019

Conversation

Shreeshrii
Copy link
Collaborator

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics#numbers

Numbers in RTL scripts run left-to-right within the right-to-left flow.

See earlier discussion in the closed PR - #2266

The change to ccutil/unicharset.cpp which was in earlier PR is NOT included, since U_ARABIC_NUMBER was being used there as part of condition to find if right_to_left scripts are significant in the unicharset.

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccutil/unicharset.cpp#L956

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 28, 2019

@zdenop Please merge this.

Test Case:

Image: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.png

Ground Truth: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.testdeco.gt.txt

Traineddata (Replace a layer training): https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara-Amiri-layer.traineddata

OCRed text - psm 6: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer.txt

For comparison, results with tessdata_best and tessdata_fast are linked below.
Those traineddatas only recognize zero and one as numerals in Arabic script - ١ ٠
They also do not recognize the Arabic punctuation marks, decorative parenthsis and commonly used ligature which was requested in Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131

OCR with tessdata_best: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer-best.txt

OCR with tessdata_fast: https://github.com/Shreeshrii/tessdata_arabic/blob/master/ara.Amiri.exp0-ara-Amiri-layer-fast.txt

According to @amitdo the text files display correctly in RTL format in Firefox.

I have tested with numbers with numerals in Arabic script at beginning, end and middle of lines. As seen in earlier results TOC with numbers at end was displaying correctly. This sample shows correct recognition and no reversal at end and middle.

Where a line starts with a number, some digits are being dropped - this was something I had noticed in a different issue with Hebrew example also. See #2263 (comment)
In the second line of the example in Hebrew, line begins with 28, only 8 is recognized. This should probably be opened as a separate issue.

@Shreeshrii
Copy link
Collaborator Author

FYI - see #2266 for earlier version of PR and related discussion.

@zdenop zdenop merged commit d7ddc4c into tesseract-ocr:master Feb 28, 2019
@Shreeshrii Shreeshrii deleted the U_ARABIC_NUMBER branch March 1, 2019 13:19
@amitdo amitdo added the RTL label Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants