Repair boundingbox of individual characters of textangle 90 text #3599

rmast · 2021-10-18T20:34:14Z

Solution to issue #3590 (makebox doesn't output horizontal coordinates of textangle 90 content).

I followed these lines back to 2010, there has been no-one fiddling with these lines, however they were most suspect of excluding RIL_SYMBOL from the matrix transformation at textangle 90.

TBOX rotate operations don't seem expensive, so it's not known why the exclusion for RIL_SYMBOL has ever been introduced.

stweil

Please remove the two unused code lines and fix the indentation of the remaining line.

~~Is that code only used when making boxes, or is it also used in recognition?~~

rmast · 2021-10-28T15:45:06Z

Please remove the two unused code lines and fix the indentation of the remaining line.

Done

~~Is that code only used when making boxes, or is it also used in recognition?~~

I see this second question has been striked through. I don't know. I've run make check afterward and after gigabytes of possibly dependend languages were downloaded and in place the checks all ran fine. It might be helpful to explicitly require only a needed set of languages in advance of the make check.

p12tic · 2022-04-21T22:28:28Z

@stweil @zdenop Is there anything that could be done to get this PR merged? I can create a separate PR rebased on top of latest main branch tip, add tests, etc.

I have looked into all call paths that could potentially be affected by the code change. There aren't too many of them in the first place and all of them are within the auxiliary functionality of getting the results of tesseract out, not the recognition itself. As a consequence, the risk of a regression is probably relatively small.

The following is the list of APIs affected:

PageIterator::BoundingBoxInternal when level == RIL_SYMBOL
PageIterator::BoundingBox when level == RIL_SYMBOL
PageIterator::GetImage when level == RIL_SYMBOL
PageIterator::GetBinaryImage when level == RIL_SYMBOL
TessBaseAPI::GetComponentImages when level == RIL_SYMBOL
TessBaseAPI::GetConnectedComponents
TessBaseAPI::GetBoxText
output of hOCR renderer when hocr_char_boxes is enabled
output of BoxText renderer

zdenop · 2022-04-28T12:10:26Z

I am sorry, but I have a minimum spare time for tesseract. PR seems to be interesting, but as this effect API, it should be well tested including effect on training.

p12tic · 2022-04-28T13:35:22Z

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

wollmers · 2022-04-29T12:03:07Z

Thanks for response. I will do extensive testing and present results in a way that requires as little time as possible to review.

@p12tic I'm interested in solving the bounding box problem. I will try to write regression tests with automatic measures covering more scripts, languages and fonts. It will need some time if I find some time in the next weeks/months. The complicated part is to create ground truth with correct bounding boxes.

p12tic · 2022-05-02T00:32:48Z

@wollmers This is great to hear. Is there any way to help? I could translate very high-level directions into working code :-) For you answering a small number of questions should take much less time than doing the implementation.

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

I'm assuming that you don't want to go the route of rendering text and OCRing the result images back, like when doing LSTM training in certain cases. In this case the character positions are essentially already known. Well, at least that's my understanding which could be completely wrong.

wollmers · 2022-05-02T09:45:32Z

@p12tic

To me it seems that annotating ground truth images with correct bounding boxes is work that is not complicated in principle, but just needs a lot of effort for automation and reviewing. This would be a perfect task for an external developer like me to accomplish.

Sorry, mismatched this PR with PR 3787. For 3787 (normal text without rotation) I wrote an approach at the weekend. See ocr-bbox-gt in prototypish Perl (without the dependencies, not published yet). If you can read it you can port it to your favourite language, which is maybe Python.

For text angles other than 0 degrees, the text image can be rotated before OCR and the bounding boxes geometrically transformed back. For degrees other than a multiple of 90 a polygon notation is needed, something like x1,y1 x2,y2 x3,y3.

Just use a clean image of text, which has no recognition errors (CER 0.0). That's the case for the sample image in 3787. Then use a legacy model with --oem 0 which provides nearly perfect bounding boxes.

Now we can check the quality as follows:

count the width (and height) per character in a first scan of the bboxes
select the width with the highest frequency ("best width")
compare the best width against the actual width (abs(width1 - width2))
count deviations per deviation

Of course this works only with clean, generated images in one and the same font, style and size. But we want to isolate the problem, reduce it only to bbox errors, thus want to exclude all other seasons for errors.

As text one page of the Human Rights Declaration (available in ~500 languages) can be used. Format it with a popular font, export as PDF, pdftoimage, tesseract. That's the work to get ground truth. Then measure the errors compared between ground truth, before patch, after patch.

With legacy only a few characters have deviation:

$ tesseract pr_3787.png pr_3787.oem0.psm6.lat.png -l t5data/lat --oem 0 --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

$charboxes: 1706

width frequency per character
[...] # all other characters have only one width
r 15: 88,  16: 3,  17: 2, 
s 15: 156, 
t 13: 107,  15: 10,  14: 7,  12: 3,  16: 3, 
u 24: 138,  25: 5, 
v 24: 20,  25: 1,

width errors
    exact: 1653 (0.9689) # color green
    in   : 50   (0.0293) # within +/-2; color orange
    out  : 3    (0.0018) # 3 't' with width 16; color red

One of the 3 errors (deviation 3 pixels):

It would be easy to correct this few remaining errors in a website (import the bboxes as JSON and wite the corrections back). Then the resulting bbox file is the ground truth.

With CTC/LTSM Tesseract release 5.1.0 it looks like this:

$ tesseract pr_3787.png pr_3787.psm6.png -l deu  --psm 6 
	--tessdata-dir  /usr/local/share/tessdata makebox hocr txt pdf

width errors
    exact: 1304 (0.7644)
    in   : 89   (0.0522)
    out  : 314  (0.1841)

The same part of the image with CTC/LTSM:

rmast · 2022-08-13T17:21:25Z

@wollmers wrote

Sorry, mismatched this PR with PR 3787.

Yes, I first wasn't able to understand what my rotation fix had to do with your response, but as I now also have run into a bounding-box issue I'll be glad trying your PR to see if that fixes it. I'll first see whether I can satisfactorily get it running with LSTM before reverting to OEM 0 for my bounding boxes. My fix doesn't fix straight up bounding boxes.

rmast mentioned this pull request Oct 18, 2021

makebox doesn't output horizontal coordinates of textangle 90 content #3590

Open

stweil requested changes Oct 27, 2021

View reviewed changes

rmast requested a review from stweil October 28, 2021 15:46

wollmers mentioned this pull request May 10, 2022

LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477

Open

rmast closed this Aug 15, 2022

rmast force-pushed the main branch from bbc9a05 to 0daf18c Compare August 15, 2022 16:24

rmast mentioned this pull request Aug 15, 2022

Repair boundingbox of individual characters of textangle 90 text PR#3599 on separate branch #3900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair boundingbox of individual characters of textangle 90 text #3599

Repair boundingbox of individual characters of textangle 90 text #3599

rmast commented Oct 18, 2021 •

edited by stweil

Loading

stweil left a comment •

edited

Loading

rmast commented Oct 28, 2021 •

edited

Loading

p12tic commented Apr 21, 2022 •

edited

Loading

zdenop commented Apr 28, 2022

p12tic commented Apr 28, 2022

wollmers commented Apr 29, 2022

p12tic commented May 2, 2022 •

edited

Loading

wollmers commented May 2, 2022

rmast commented Aug 13, 2022

Repair boundingbox of individual characters of textangle 90 text #3599

Repair boundingbox of individual characters of textangle 90 text #3599

Conversation

rmast commented Oct 18, 2021 • edited by stweil Loading

stweil left a comment • edited Loading

Choose a reason for hiding this comment

rmast commented Oct 28, 2021 • edited Loading

p12tic commented Apr 21, 2022 • edited Loading

zdenop commented Apr 28, 2022

p12tic commented Apr 28, 2022

wollmers commented Apr 29, 2022

p12tic commented May 2, 2022 • edited Loading

wollmers commented May 2, 2022

rmast commented Aug 13, 2022

rmast commented Oct 18, 2021 •

edited by stweil

Loading

stweil left a comment •

edited

Loading

rmast commented Oct 28, 2021 •

edited

Loading

p12tic commented Apr 21, 2022 •

edited

Loading

p12tic commented May 2, 2022 •

edited

Loading