Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

Open
grantbarrett opened this issue Sep 2, 2022 · 1 comment

Comments

@grantbarrett
Copy link

One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.

If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).

https://github.com/tesseract-ocr/langdata_lstm/blob/main/enm/enm.unicharset
https://en.wikipedia.org/wiki/Old_English_Latin_alphabet

https://en.wikipedia.org/wiki/Old_English_Latin_alphabet
https://en.wikipedia.org/wiki/Eth
https://en.wikipedia.org/wiki/Wynn

Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!

As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.

If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)

@stweil
Copy link
Contributor

stweil commented Nov 25, 2022

It is possible to enhance the existing model with those additional glyphs. The original training was done with artificial training data, but I think that you will get better results with transcribed scans from historic books.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants