Arabic training data has room for improvement #2047

mustafa0x · 2018-11-09T11:47:40Z

I just installed Tesseteract 4.0.0 (see: Environment), OCR'ed a sample document using ara and arab-script, and found that the ara-amiri-3000 training data (created by @Shreeshrii) continues to be superior, which is consistent to my more rigorous testing in the beta 1 days (see: tesseract-ocr/tessdata_best#11 (comment)).

I also noticed that the Arabic comma is still not recognized (it is by ara-amiri-3000), and nor are the Indic numerals (most of the time). I created an issue to add these to the Arabic langdata, but haven't yet received a response (tesseract-ocr/langdata#131).

Related: It's possible that using the Scheherazade font for training will give better results, as it's very similar to Lotus Light — the font used by most printed Arabic books.

Thanks!

Environment

~# lsb_release -a
Description: Ubuntu 18.04.1 LTS
~# add-apt-repository ppa:alex-p/tesseract-ocr
~# apt update
~# apt install tesseract-ocr tesseract-ocr-ara tesseract-ocr-script-arab
~# tesseract --version
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

The text was updated successfully, but these errors were encountered:

AndreAhmed · 2018-11-22T16:13:28Z

@mustafa0x Can you upload the amiri dataset ?

amitdo · 2018-11-22T16:33:44Z

tesseract-ocr/tessdata_best#11
https://github.com/Shreeshrii/tessdata_shreetest/tree/72294b8476b9

Shreeshrii · 2018-11-22T18:05:47Z

It seems like I messed up and overwrote the files in that repository in a way that lost the old files. If someone has a copy of the traineddata, please attach to this thread and I will upload again. thanks.

…

On Thu, 22 Nov 2018, 11:34 Amit D. ***@***.*** wrote: tesseract-ocr/tessdata_best#11 <tesseract-ocr/tessdata_best#11> https://github.com/Shreeshrii/tessdata_shreetest/tree/72294b8476b9 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2047 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o8UVQeCKZ0ZyN_K9AjLweOxwQyMbks5uxtH9gaJpZM4YWcRb> .

mustafa0x · 2018-11-22T19:14:44Z

@Shreeshrii: @amitdo was able to very-impressively pull it up.

Shreeshrii · 2018-11-23T16:42:15Z

Thnaks @amitdo. I have now re-uploaded (there might have been a git way to do so, but it was easier for me to download and then upload). Glad that the finetuned traineddata files are helpful for people.

…

On Thu, Nov 22, 2018 at 2:14 PM mustafa0x ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii>: @amitdo <https://github.com/amitdo> was able to very-impressively pull it up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2047 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o_eJNmJSvSms91Xp7ckwWKTA-edKks5uxvezgaJpZM4YWcRb> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

AndreAhmed · 2018-11-23T16:51:27Z

Your dataset doesn't work with number and arabic text combined together

Shreeshrii · 2018-11-23T17:59:48Z

Please send a small sample Arabic text with proper usage of digits, text and other punctuation and I can try to fine-tune with it.

…

On Fri, 23 Nov 2018, 11:51 andreahmed ***@***.*** wrote: Your dataset doesn't work with number and arabic text combined together — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2047 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o80fd27GjZls3uuZUVB6XiXKVKBmks5uyCeegaJpZM4YWcRb> .

AndreAhmed · 2018-11-23T19:27:59Z

@amitdo do you have a dataset with arabic numbers and arabic text that @Shreeshrii can use ?
I need that urgently.

mustafa0x · 2018-11-23T19:41:14Z

@Shreeshrii This is nearly 8k characters, 100 of which are Arabic numerals.

sample-arabic.txt

amitdo · 2018-11-23T19:41:54Z

No, I don't have such a dataset.

AndreAhmed · 2018-11-24T16:36:03Z

@Shreeshrii do you have any success ? I think it's very complicated, arabic text with numbers doesn't work in that engine

AndreAhmed · 2018-11-26T08:31:55Z

@Shreeshrii any progress?

mustafa0x · 2018-11-26T08:50:24Z

@AndreAhmed It's the weekend, it's the holidays, and this is open source work. Please wait until Shreeshrii has some free time to spend on this.

jaddoughman · 2018-11-28T18:29:20Z

@AndreAhmed @Shreeshrii I happen to have a large dataset of Arabic text, with Arabic numbers and text. I'm having issues fine tuning the original _best model. I will attach the dataset below (composed of ~4000 lines). If you can attempt to fine tune using them, that would be great.
train.zip

The generated .box files might be an issue.

AndreAhmed · 2018-11-28T20:51:36Z

@Shreeshrii @jaddoughman I think there is a major problem with arabic text with numbers including, I'm not sure it's a fine tuning issue, or recognize problem from the OCR itself.

Shreeshrii · 2018-11-28T21:40:45Z

That is my suspicion also. However we can not know for sure until @theraysmith responds.

From what I understand, while Arabic is written as RTL, Arabic numbers are written as LTR.

The training works by treating all text regard less of language direction as LTR. There is supposed to be a higher level function which changes the language direction.
It is possible that there is some problem with that, specially when a number is written with punctuation.

The above is pure speculation on my part.

It is also possible that the problem exists only in github tesseract and not in Google's version.

It will help if someone can create a 2-3 lines minimal test case and test with all varieties of traineddata, with --oem 1 and --oem 0, and also with 3.05 version with cube to see whether problem exists in all cases.

Then it will be easier for @jbredien to check whether same test case works in Google's tesseract.

AndreAhmed · 2018-11-28T21:59:00Z

I did test with a lot of trained data for simple 3 lines text, it never work at all, there is no recognition of the numbers at all @Shreeshrii

AndreAhmed · 2018-11-29T14:13:44Z

I debugged it, it can detect numbers at all, that's one problem. Another problem is the combination of arabic letters with numbers, the engine ignores the numbers and it doesn't detect it. Because at first it doesn't recognize the numbers. @Shreeshrii @theraysmith

MariamHijazi · 2018-12-02T06:16:40Z

I'm trying to generate dataset for Arabic text with Arabic-Indian Numbers with JessEditor, but I guess jTessBoxEditor does not support Arabic-Indian Numbers which is (۰ - ۱ - ۲ - ۳ - ٤ - ٥ - ٦ - ٧ - ۸ - ۹) with Unicode (U+0660 -> U+0669) is there any dataset support Arabic text with Arabic- Indian numbers ? can you add Arabic Indian numbers to JessEditor?
Any suggestion for this issue?
@Shreeshrii
Kind Regards.

jagdishgg · 2019-01-15T09:57:25Z

I have tried ara.traineddata, ara-amiri-3000.traineddata and Arabic.traineddata on tesseract v4.0.0.20181030 Windows platform. My data has Arabic plus English and English Numericals. All the traineddata giving incorrect results and if I OCred 5 images ( ID cards )of same format, I am getting the results in different formats, Example 4 images giving Name of person on line 9 but 5th one giving on line 10, how do we control that
Commands used
tesseract "F:\Temp\3\1.jpeg" "F:\Temp\3\out1.txt" -l eng -c tessedit_char_whitelist=01234ABCDE --oem 1 --psm 11
tesseract "F:\Temp\3\1.jpeg" "F:\Temp\3\out1.txt" -l script/arabic --psm 6
tesseract "F:\Temp\3\2.jpeg" "F:\Temp\3\out2.txt" -l script/arabic --psm 11

Shreeshrii · 2019-02-23T19:34:14Z

Please see tesseract-ocr/tesseract#2263 (comment)
and test if the traineddata files linked there add all the required characters.

stweil added the accuracy label Nov 16, 2018

Shreeshrii mentioned this issue Feb 25, 2019

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Closed

amitdo added the RTL label Mar 18, 2021

avidseeker mentioned this issue Aug 6, 2024

Issues when using Arabic language tesseract-ocr/tessdata#110

Open

benoit-pierre mentioned this issue Sep 8, 2024

App Crash when highlight a word on Arabic pdf koreader/koreader#12478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic training data has room for improvement #2047

Arabic training data has room for improvement #2047

mustafa0x commented Nov 9, 2018

AndreAhmed commented Nov 22, 2018

amitdo commented Nov 22, 2018

Shreeshrii commented Nov 22, 2018 via email

mustafa0x commented Nov 22, 2018

Shreeshrii commented Nov 23, 2018 via email

AndreAhmed commented Nov 23, 2018

Shreeshrii commented Nov 23, 2018 via email

AndreAhmed commented Nov 23, 2018

mustafa0x commented Nov 23, 2018

amitdo commented Nov 23, 2018

AndreAhmed commented Nov 24, 2018

AndreAhmed commented Nov 26, 2018

mustafa0x commented Nov 26, 2018

jaddoughman commented Nov 28, 2018

AndreAhmed commented Nov 28, 2018

Shreeshrii commented Nov 28, 2018

AndreAhmed commented Nov 28, 2018

AndreAhmed commented Nov 29, 2018

MariamHijazi commented Dec 2, 2018 •

edited

Loading

jagdishgg commented Jan 15, 2019

Shreeshrii commented Feb 23, 2019

Arabic training data has room for improvement #2047

Arabic training data has room for improvement #2047

Comments

mustafa0x commented Nov 9, 2018

Environment

AndreAhmed commented Nov 22, 2018

amitdo commented Nov 22, 2018

Shreeshrii commented Nov 22, 2018 via email

mustafa0x commented Nov 22, 2018

Shreeshrii commented Nov 23, 2018 via email

AndreAhmed commented Nov 23, 2018

Shreeshrii commented Nov 23, 2018 via email

AndreAhmed commented Nov 23, 2018

mustafa0x commented Nov 23, 2018

amitdo commented Nov 23, 2018

AndreAhmed commented Nov 24, 2018

AndreAhmed commented Nov 26, 2018

mustafa0x commented Nov 26, 2018

jaddoughman commented Nov 28, 2018

AndreAhmed commented Nov 28, 2018

Shreeshrii commented Nov 28, 2018

AndreAhmed commented Nov 28, 2018

AndreAhmed commented Nov 29, 2018

MariamHijazi commented Dec 2, 2018 • edited Loading

jagdishgg commented Jan 15, 2019

Shreeshrii commented Feb 23, 2019

MariamHijazi commented Dec 2, 2018 •

edited

Loading