empty filename in ALTO xml file #2700

renarios · 2019-10-08T19:26:43Z

Environment

Tesseract Version: tesseract 5.0.0-alpha-469-g6b35
Commit Number:
Platform: Linux 4.15.0-65-generic Compile time option NO_CUBE_BUILD disables some features which are not actually related to cube engine in file api/baseapi.cpp #74-Ubuntu x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

Running tesseract <tif file> <basename> -l nld --dpi 300 --oem 2 --psm 1 alto gives an xml output file.
In the xml output file the filename is empty:

                <sourceImageInformation>
                        <fileName>                      </fileName>
                </sourceImageInformation>

Expected Behavior:

            <sourceImageInformation>
                    <fileName><tif file></fileName>
            </sourceImageInformation>

Suggested Fix:

insert filename

The text was updated successfully, but these errors were encountered:

stweil · 2019-10-09T20:14:16Z

Thank you for reporting this. The error already existed with commit d7cee03.

The title can be set for hOCR and PDF output. Currently it is also used for ALTO, so setting the title can be used as a workaround for issue tesseract-ocr#2700. The constant unknown_title_ is no longer needed and therefore removed. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-10-10T13:47:20Z

Pull request #2705 implements a workaround to set the missing filename.

The final fix needs more efforts because the image filename is currently not available
from function TessAltoRenderer::BeginDocumentHandler which writes the filename.

stweil · 2019-10-14T17:36:28Z

I just had a look on our ALTO files (created by ABBYY FineReader). None of them contains <sourceImageInformation>.

The title can be set for hOCR and PDF output. Currently it is also used for ALTO, so setting the title can be used as a workaround for issue #2700. The constant unknown_title_ is no longer needed and therefore removed. Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2019-11-08T16:21:08Z

@renarios, is the XML element sourceImageInformation needed or can we simply remove it.

renarios · 2019-11-11T08:18:34Z

@stweil, I found the ALTO standard in this website and it says that the element is not required, but it is preferable to add it.

marma · 2019-11-15T12:39:53Z

Alto-files are often (or at least sometimes) stored alongside the images used for OCR. There is definitely a point in referencing the image in the Alto as that relationship would otherwise have to be described or deduced some other way.

Example: https://data.kb.se/datasets/2014/10/aftonbladet/1862/01/urn%253Anbn%253Ase%253Akb%253Adark-29967/

bertsky · 2019-12-16T10:10:09Z

I don't think we should be satisfied with the workaround in #2705 yet. For the user this means calling something like tesseract -c document_title=myimg.tif myimg.tif myimg alto (and similar contortions via API). And this does not even work at all for multi-input (see below).

We know the input image file name, so that's exactly what we should be referencing in /alto/description/sourceImageInformation/fileName. It's just a matter of proper refactoring. And that would be trivial if we did not have the two multi-input options:

text file with path names
multi-page TIFF image

But those two make the existing structure of setting up the output file from the output basename and renderer extension in the constructor of the renderers already, and then for each page's results merely appending text to that file inadequate: For ALTO, we actually have to use different output files for different pages!

And each output file must refer to:

for 1: the respective line in the input text file,
for 2: the same (multi-page) TIFF, perhaps with the page number in /alto/Layout/Page/@PHYSICAL_IMG_NR to differentiate them.

It's probably not just ALTO – there might be other (current or future) single-page renderers, too. But we definitely also have output options that need multi-page rendering behaviour, e.g. PDF.

So, it's not enough to just call BeginDocument(filename) and EndDocument() (on the chain of renderers) around each individual ProcessPage(...filename...) instead of around the whole sequence: That would instead break multi-page renderers!

One solution might be to allow renderers to switch their underlying title_ and fout_ during AddImage() – by adding filename to the arguments of that method. Then a single-page renderer like ALTO could overload ...

bool TessResultRenderer::AddImage(TessBaseAPI* api, const char* filename) {
  if (!happy_) return false;
  ++imagenum_;
  bool ok = AddImageHandler(api);
  if (next_) {
    ok = next_->AddImage(api, filename) && ok;
  }
  return ok;
}

... with something like ...

bool TessAltoRenderer::AddImage(TessBaseAPI* api, const char* filename) {
  if (!happy_) return false;
  ++imagenum_;
  // begin: single-page behaviour
  if (imagenum_ > 0)
    happy_ = EndDocumentHandler(); // append postamble
  if (strcmp(outputbase, "-") && strcmp(outputbase, "stdout")) {
    if (imagenum_ > 0)
      fclose(fout_);
    STRING outfile = STRING(outputbase_);
    outfile.add_str_int("_", imagenum_);
    outfile += STRING(".") + STRING(file_extension_);
    fout_ = fopen(outfile.c_str(), "wb");
    if (fout_ == nullptr) {
      happy_ = false;
    }
  }
  title_ = filename;
  happy_ = BeginDocumentHandler() && happy_; // append preamble
  if (!happy_) return false;
  // end: single-page behaviour
  bool ok = AddImageHandler(api); // append results
  if (next_) {
    ok = next_->AddImage(api, filename) && ok;
  }
  return ok;
}

Of course, one might even sub-class the old behaviour into TessSinglePageResultRenderer and the new into TessMultiPageResultRenderer (both still with abstract constructors) to make this distinction systematic and avoid code duplication.

An alternative, much simpler solution could be to just return with an error when ALTO output is requested in the multi-input case. (But some structural changes are required even in the single-input case, because the input filename still needs to enter the preamble.)

@stweil what do you think?

M3ssman · 2020-05-04T08:20:33Z

Is there actually any regarding this issue?

Please, keep in mind that this issue is about unexpected behavior that should be turned off: Tessract Version 4.1.1 writes three tabs as text where a filename should appear. I'd prefer a pragmatic solution.

stweil · 2020-05-05T18:26:15Z

@bertsky wrote a good summary of the problem which avoids an easy fix. Basically the current API needs changes to provide the filename at the right place.

Changing the API would be possible as we talk about Tesseract 5 which may be API incompatible to Tesseract 4. But we also have to consider third party software like tesserocr which must work with Tesseract 4 and 5. That possibly makes it difficult.

Writing no sourceImageInformation tag if there is no filename available would be a pragmatic solution (and compatible with ABBYY Finereader). Should that be implemented as an intermediate solution?

M3ssman · 2020-05-05T18:59:39Z

If these are the options, it's preferable to skip the output of data lacking any value.

bertsky · 2020-05-05T19:05:51Z

Tessract Version 4.1.1 writes three tabs as text where a filename should appear.

@M3ssman I don't understand where these three tabs come from in the current implementation. In my understanding, sourceImageInformation/fileName should already be empty if not explicitly calling with -c document_title=somefile.tif etc. (Whatever is in there gets passed to title via BeginDocument() in all renderers.)

Writing no sourceImageInformation tag if there is no filename available would be a pragmatic solution

@stweil did you mean we have to omit the element sourceImageInformation altogether when fileName would be empty? (This would be a simple change I can make as part of #2815.)

Changing the API would be possible as we talk about Tesseract 5 which may be API incompatible to Tesseract 4. But we also have to consider third party software like tesserocr which must work with Tesseract 4 and 5. That possibly makes it difficult.

Difficult yes, but worthwhile: We have a structural problem (multi-input with single- vs multi-output renderers) related to API and CLI which will not go away. To avoid breaking the API because there are already early adopters effectively means locking down the API forever. Instead we should active support the transition in modules like tesserocr. (It also still contains many workarounds from the 3-4 transition.)

M3ssman · 2020-05-05T19:20:23Z

@bertsky Sorry, I messed up. Tesseract 4.1.1 produces empty Elements like

<sourceImageInformation>
	<fileName></fileName>
</sourceImageInformation>

Tesseract 4.1.0 produced the tab output

<sourceImageInformation>
	<fileName>			</fileName>
</sourceImageInformation>

stweil · 2021-08-08T11:28:49Z

Pull request #3517 is merged now in Git master, so this issue can be closed.
Fixing the issue requires API changes, so there won't be a solution for the 4.1 branch.

bertsky · 2021-08-16T23:20:01Z

Pull request #3517 is merged now in Git master, so this issue can be closed.
Fixing the issue requires API changes, so there won't be a solution for the 4.1 branch.

@stweil, okay so in #3517 you went for the "simpler solution" sketched above, which does not generalize to the multi-input case:

An alternative, much simpler solution could be to just return with an error when ALTO output is requested in the multi-input case. (But some structural changes are required even in the single-input case, because the input filename still needs to enter the preamble.)

Thus, IMO you still need to abort with an error if the ALTO renderer is requested for multi-input (multi-page TIFF or multi-line text file). In the current implementation, only the first page will have the correct sourceImageInformation, the others will silently get the wrong reference.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2021-08-18T20:02:27Z

If these are the options, it's preferable to skip the output of data lacking any value.

This is now implemented for 4.1 in commit b2eb72b.

stweil · 2021-08-18T20:16:35Z

Thus, IMO you still need to abort with an error if the ALTO renderer is requested for multi-input (multi-page TIFF or multi-line text file). In the current implementation, only the first page will have the correct sourceImageInformation, the others will silently get the wrong reference.

For multi-page TIFF the current solution works perfectly: there is only a single image file, and it is correctly named in sourceImageInformation.

If Tesseract processes a list of image files, sourceImageInformation gets the name of the first image file. The ALTO standard does not support a better solution as far as I see. As the rest of the ALTO output is fine, I don't think that we should abort for this use case.

bertsky · 2021-08-18T20:35:26Z

For multi-page TIFF the current solution works perfectly: there is only a single image file, and it is correctly named in sourceImageInformation.

Good point!

If Tesseract processes a list of image files, sourceImageInformation gets the name of the first image file. The ALTO standard does not support a better solution as far as I see. As the rest of the ALTO output is fine, I don't think that we should abort for this use case.

Sorry, I thought I had seen a multi-output case before (using outputbase with different suffixes), which would have warranted an expectation on the user's side that the ALTO renderer works. But since that is not the case – CLI output is always single-file – I fully agree aborting would be wrong.

A warning for each single-page renderer (ALTO, hOCR, ...) active in a multi-input run would still be nice, though.

stweil added the bug label Oct 9, 2019

stweil added this to the 5.0.0 milestone Oct 9, 2019

stweil mentioned this issue Oct 10, 2019

Add new parameter "document_title" to set the title in OCR output files #2705

Merged

stweil mentioned this issue Nov 15, 2019

Retain all bounding box levels in Alto output #2766

Closed

stweil mentioned this issue Dec 13, 2019

ALTO renderer: move to v4, add Glyphs #2815

Open

tesseract-ocr deleted a comment from GSATHYANARAYANA May 5, 2020

M3ssman mentioned this issue Jun 17, 2021

Unclear: localsave proper configuration OCR4all/LAREX#260

Closed

stweil mentioned this issue Aug 7, 2021

Write image filename in ALTO output and reduce size of renderer classes #3517

Merged

stweil closed this as completed Aug 8, 2021

stweil added a commit that referenced this issue Aug 18, 2021

Don't output empty ALTO sourceImageInformation (issue #2700)

b2eb72b

Signed-off-by: Stefan Weil <sw@weilnetz.de>

sunoru mentioned this issue Apr 11, 2022

[API] ALTO renderer does not work if no input name is set. #3788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

empty filename in ALTO xml file #2700

empty filename in ALTO xml file #2700

renarios commented Oct 8, 2019 •

edited

Loading

stweil commented Oct 9, 2019 •

edited

Loading

stweil commented Oct 10, 2019

stweil commented Oct 14, 2019

stweil commented Nov 8, 2019 •

edited

Loading

renarios commented Nov 11, 2019

marma commented Nov 15, 2019

bertsky commented Dec 16, 2019 •

edited

Loading

M3ssman commented May 4, 2020

stweil commented May 5, 2020 •

edited

Loading

M3ssman commented May 5, 2020

bertsky commented May 5, 2020

M3ssman commented May 5, 2020

stweil commented Aug 8, 2021

bertsky commented Aug 16, 2021

stweil commented Aug 18, 2021 •

edited

Loading

stweil commented Aug 18, 2021

bertsky commented Aug 18, 2021

empty filename in ALTO xml file #2700

empty filename in ALTO xml file #2700

Comments

renarios commented Oct 8, 2019 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

stweil commented Oct 9, 2019 • edited Loading

stweil commented Oct 10, 2019

stweil commented Oct 14, 2019

stweil commented Nov 8, 2019 • edited Loading

renarios commented Nov 11, 2019

marma commented Nov 15, 2019

bertsky commented Dec 16, 2019 • edited Loading

M3ssman commented May 4, 2020

stweil commented May 5, 2020 • edited Loading

M3ssman commented May 5, 2020

bertsky commented May 5, 2020

M3ssman commented May 5, 2020

stweil commented Aug 8, 2021

bertsky commented Aug 16, 2021

stweil commented Aug 18, 2021 • edited Loading

stweil commented Aug 18, 2021

bertsky commented Aug 18, 2021

renarios commented Oct 8, 2019 •

edited

Loading

stweil commented Oct 9, 2019 •

edited

Loading

stweil commented Nov 8, 2019 •

edited

Loading

bertsky commented Dec 16, 2019 •

edited

Loading

stweil commented May 5, 2020 •

edited

Loading

stweil commented Aug 18, 2021 •

edited

Loading