Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair boundingbox of individual characters of textangle 90 text #3599

Closed
wants to merge 0 commits into from

Conversation

rmast
Copy link

@rmast rmast commented Oct 18, 2021

Solution to issue #3590 (makebox doesn't output horizontal coordinates of textangle 90 content).

I followed these lines back to 2010, there has been no-one fiddling with these lines, however they were most suspect of excluding RIL_SYMBOL from the matrix transformation at textangle 90.

TBOX rotate operations don't seem expensive, so it's not known why the exclusion for RIL_SYMBOL has ever been introduced.

Copy link
Contributor

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the two unused code lines and fix the indentation of the remaining line.

Is that code only used when making boxes, or is it also used in recognition?

@rmast
Copy link
Author

rmast commented Oct 28, 2021

Please remove the two unused code lines and fix the indentation of the remaining line.

Done

Is that code only used when making boxes, or is it also used in recognition?

I see this second question has been striked through. I don't know. I've run make check afterward and after gigabytes of possibly dependend languages were downloaded and in place the checks all ran fine. It might be helpful to explicitly require only a needed set of languages in advance of the make check.

@rmast rmast requested a review from stweil October 28, 2021 15:46
@p12tic
Copy link
Contributor

p12tic commented Apr 21, 2022

@stweil @zdenop Is there anything that could be done to get this PR merged? I can create a separate PR rebased on top of latest main branch tip, add tests, etc.

I have looked into all call paths that could potentially be affected by the code change. There aren't too many of them in the first place and all of them are within the auxiliary functionality of getting the results of tesseract out, not the recognition itself. As a consequence, the risk of a regression is probably relatively small.

The following is the list of APIs affected:

  • PageIterator::BoundingBoxInternal when level == RIL_SYMBOL
  • PageIterator::BoundingBox when level == RIL_SYMBOL
  • PageIterator::GetImage when level == RIL_SYMBOL
  • PageIterator::GetBinaryImage when level == RIL_SYMBOL
  • TessBaseAPI::GetComponentImages when level == RIL_SYMBOL
  • TessBaseAPI::GetConnectedComponents
  • TessBaseAPI::GetBoxText
  • output of hOCR renderer when hocr_char_boxes is enabled
  • output of BoxText renderer

@zdenop
Copy link
Contributor

zdenop commented Apr 28, 2022

I am sorry, but I have a minimum spare time for tesseract. PR seems to be interesting, but as this effect API, it should be well tested including effect on training.

@p12tic
Copy link
Contributor

p12tic commented Apr 28, 2022

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

@wollmers
Copy link

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

@p12tic I'm interested in solving the bounding box problem. I will try to write regression tests with automatic measures covering more scripts, languages and fonts. It will need some time if I find some time in the next weeks/months. The complicated part is to create ground truth with correct bounding boxes.

@p12tic
Copy link
Contributor

p12tic commented May 2, 2022

@wollmers This is great to hear. Is there any way to help? I could translate very high-level directions into working code :-) For you answering a small number of questions should take much less time than doing the implementation.

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

I'm assuming that you don't want to go the route of rendering text and OCRing the result images back, like when doing LSTM training in certain cases. In this case the character positions are essentially already known. Well, at least that's my understanding which could be completely wrong.

@wollmers
Copy link

wollmers commented May 2, 2022

@p12tic

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

Sorry, mismatched this PR with PR 3787. For 3787 (normal text without rotation) I wrote an approach at the weekend. See ocr-bbox-gt in prototypish Perl (without the dependencies, not published yet). If you can read it you can port it to your favourite language, which is maybe Python.

For text angles other than 0 degrees, the text image can be rotated before OCR and the bounding boxes geometrically transformed back. For degrees other than a multiple of 90 a polygon notation is needed, something like x1,y1 x2,y2 x3,y3.

Just use a clean image of text, which has no recognition errors (CER 0.0). That's the case for the sample image in 3787. Then use a legacy model with --oem 0 which provides nearly perfect bounding boxes.

Now we can check the quality as follows:

  • count the width (and height) per character in a first scan of the bboxes
  • select the width with the highest frequency ("best width")
  • compare the best width against the actual width (abs(width1 - width2))
  • count deviations per deviation

Of course this works only with clean, generated images in one and the same font, style and size. But we want to isolate the problem, reduce it only to bbox errors, thus want to exclude all other seasons for errors.

As text one page of the Human Rights Declaration (available in ~500 languages) can be used. Format it with a popular font, export as PDF, pdftoimage, tesseract. That's the work to get ground truth. Then measure the errors compared between ground truth, before patch, after patch.

With legacy only a few characters have deviation:

$ tesseract pr_3787.png pr_3787.oem0.psm6.lat.png -l t5data/lat --oem 0 --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

$charboxes: 1706

width frequency per character
[...] # all other characters have only one width
r 15: 88,  16: 3,  17: 2, 
s 15: 156, 
t 13: 107,  15: 10,  14: 7,  12: 3,  16: 3, 
u 24: 138,  25: 5, 
v 24: 20,  25: 1,

width errors
    exact: 1653 (0.9689) # color green
    in   : 50   (0.0293) # within +/-2; color orange
    out  : 3    (0.0018) # 3 't' with width 16; color red

One of the 3 errors (deviation 3 pixels):

Bildschirmfoto 2022-05-02 um 11 02 03

It would be easy to correct this few remaining errors in a website (import the bboxes as JSON and wite the corrections back). Then the resulting bbox file is the ground truth.

With CTC/LTSM Tesseract release 5.1.0 it looks like this:

$ tesseract pr_3787.png pr_3787.psm6.png -l deu  --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

width errors
    exact: 1304 (0.7644)
    in   : 89   (0.0522)
    out  : 314  (0.1841)

The same part of the image with CTC/LTSM:

Bildschirmfoto 2022-05-02 um 11 05 33

@rmast
Copy link
Author

rmast commented Aug 13, 2022

@wollmers wrote

Sorry, mismatched this PR with PR 3787.

Yes, I first wasn't able to understand what my rotation fix had to do with your response, but as I now also have run into a bounding-box issue I'll be glad trying your PR to see if that fixes it. I'll first see whether I can satisfactorily get it running with LSTM before reverting to OEM 0 for my bounding boxes. My fix doesn't fix straight up bounding boxes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants