Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty filename in ALTO xml file #2700

Closed
renarios opened this issue Oct 8, 2019 · 17 comments
Closed

empty filename in ALTO xml file #2700

renarios opened this issue Oct 8, 2019 · 17 comments
Labels
Milestone

Comments

@renarios
Copy link

renarios commented Oct 8, 2019

Environment

Current Behavior:

Running tesseract <tif file> <basename> -l nld --dpi 300 --oem 2 --psm 1 alto gives an xml output file.
In the xml output file the filename is empty:

                <sourceImageInformation>
                        <fileName>                      </fileName>
                </sourceImageInformation>

Expected Behavior:

            <sourceImageInformation>
                    <fileName><tif file></fileName>
            </sourceImageInformation>

Suggested Fix:

insert filename

@stweil stweil added the bug label Oct 9, 2019
@stweil stweil added this to the 5.0.0 milestone Oct 9, 2019
@stweil
Copy link
Contributor

stweil commented Oct 9, 2019

Thank you for reporting this. The error already existed with commit d7cee03.

stweil added a commit to stweil/tesseract that referenced this issue Oct 10, 2019
The title can be set for hOCR and PDF output.

Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue tesseract-ocr#2700.

The constant unknown_title_ is no longer needed and therefore removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Oct 10, 2019

Pull request #2705 implements a workaround to set the missing filename.

The final fix needs more efforts because the image filename is currently not available
from function TessAltoRenderer::BeginDocumentHandler which writes the filename.

@stweil
Copy link
Contributor

stweil commented Oct 14, 2019

I just had a look on our ALTO files (created by ABBYY FineReader). None of them contains <sourceImageInformation>.

zdenop pushed a commit that referenced this issue Nov 1, 2019
The title can be set for hOCR and PDF output.

Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue #2700.

The constant unknown_title_ is no longer needed and therefore removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Nov 8, 2019

@renarios, is the XML element sourceImageInformation needed or can we simply remove it.

@renarios
Copy link
Author

@stweil, I found the ALTO standard in this website and it says that the element is not required, but it is preferable to add it.

@marma
Copy link
Contributor

marma commented Nov 15, 2019

Alto-files are often (or at least sometimes) stored alongside the images used for OCR. There is definitely a point in referencing the image in the Alto as that relationship would otherwise have to be described or deduced some other way.

Example: https://data.kb.se/datasets/2014/10/aftonbladet/1862/01/urn%253Anbn%253Ase%253Akb%253Adark-29967/

@bertsky
Copy link
Contributor

bertsky commented Dec 16, 2019

I don't think we should be satisfied with the workaround in #2705 yet. For the user this means calling something like tesseract -c document_title=myimg.tif myimg.tif myimg alto (and similar contortions via API). And this does not even work at all for multi-input (see below).

We know the input image file name, so that's exactly what we should be referencing in /alto/description/sourceImageInformation/fileName. It's just a matter of proper refactoring. And that would be trivial if we did not have the two multi-input options:

  1. text file with path names
  2. multi-page TIFF image

But those two make the existing structure of setting up the output file from the output basename and renderer extension in the constructor of the renderers already, and then for each page's results merely appending text to that file inadequate: For ALTO, we actually have to use different output files for different pages!

And each output file must refer to:

  • for 1: the respective line in the input text file,
  • for 2: the same (multi-page) TIFF, perhaps with the page number in /alto/Layout/Page/@PHYSICAL_IMG_NR to differentiate them.

It's probably not just ALTO – there might be other (current or future) single-page renderers, too. But we definitely also have output options that need multi-page rendering behaviour, e.g. PDF.

So, it's not enough to just call BeginDocument(filename) and EndDocument() (on the chain of renderers) around each individual ProcessPage(...filename...) instead of around the whole sequence: That would instead break multi-page renderers!

One solution might be to allow renderers to switch their underlying title_ and fout_ during AddImage() – by adding filename to the arguments of that method. Then a single-page renderer like ALTO could overload ...

bool TessResultRenderer::AddImage(TessBaseAPI* api, const char* filename) {
  if (!happy_) return false;
  ++imagenum_;
  bool ok = AddImageHandler(api);
  if (next_) {
    ok = next_->AddImage(api, filename) && ok;
  }
  return ok;
}

... with something like ...

bool TessAltoRenderer::AddImage(TessBaseAPI* api, const char* filename) {
  if (!happy_) return false;
  ++imagenum_;
  // begin: single-page behaviour
  if (imagenum_ > 0)
    happy_ = EndDocumentHandler(); // append postamble
  if (strcmp(outputbase, "-") && strcmp(outputbase, "stdout")) {
    if (imagenum_ > 0)
      fclose(fout_);
    STRING outfile = STRING(outputbase_);
    outfile.add_str_int("_", imagenum_);
    outfile += STRING(".") + STRING(file_extension_);
    fout_ = fopen(outfile.c_str(), "wb");
    if (fout_ == nullptr) {
      happy_ = false;
    }
  }
  title_ = filename;
  happy_ = BeginDocumentHandler() && happy_; // append preamble
  if (!happy_) return false;
  // end: single-page behaviour
  bool ok = AddImageHandler(api); // append results
  if (next_) {
    ok = next_->AddImage(api, filename) && ok;
  }
  return ok;
}

Of course, one might even sub-class the old behaviour into TessSinglePageResultRenderer and the new into TessMultiPageResultRenderer (both still with abstract constructors) to make this distinction systematic and avoid code duplication.

An alternative, much simpler solution could be to just return with an error when ALTO output is requested in the multi-input case. (But some structural changes are required even in the single-input case, because the input filename still needs to enter the preamble.)

@stweil what do you think?

@M3ssman
Copy link
Contributor

M3ssman commented May 4, 2020

Is there actually any regarding this issue?

Please, keep in mind that this issue is about unexpected behavior that should be turned off: Tessract Version 4.1.1 writes three tabs as text where a filename should appear. I'd prefer a pragmatic solution.

@tesseract-ocr tesseract-ocr deleted a comment from GSATHYANARAYANA May 5, 2020
@stweil
Copy link
Contributor

stweil commented May 5, 2020

@bertsky wrote a good summary of the problem which avoids an easy fix. Basically the current API needs changes to provide the filename at the right place.

Changing the API would be possible as we talk about Tesseract 5 which may be API incompatible to Tesseract 4. But we also have to consider third party software like tesserocr which must work with Tesseract 4 and 5. That possibly makes it difficult.

Writing no sourceImageInformation tag if there is no filename available would be a pragmatic solution (and compatible with ABBYY Finereader). Should that be implemented as an intermediate solution?

@M3ssman
Copy link
Contributor

M3ssman commented May 5, 2020

If these are the options, it's preferable to skip the output of data lacking any value.

@bertsky
Copy link
Contributor

bertsky commented May 5, 2020

Tessract Version 4.1.1 writes three tabs as text where a filename should appear.

@M3ssman I don't understand where these three tabs come from in the current implementation. In my understanding, sourceImageInformation/fileName should already be empty if not explicitly calling with -c document_title=somefile.tif etc. (Whatever is in there gets passed to title via BeginDocument() in all renderers.)

Writing no sourceImageInformation tag if there is no filename available would be a pragmatic solution

@stweil did you mean we have to omit the element sourceImageInformation altogether when fileName would be empty? (This would be a simple change I can make as part of #2815.)

Changing the API would be possible as we talk about Tesseract 5 which may be API incompatible to Tesseract 4. But we also have to consider third party software like tesserocr which must work with Tesseract 4 and 5. That possibly makes it difficult.

Difficult yes, but worthwhile: We have a structural problem (multi-input with single- vs multi-output renderers) related to API and CLI which will not go away. To avoid breaking the API because there are already early adopters effectively means locking down the API forever. Instead we should active support the transition in modules like tesserocr. (It also still contains many workarounds from the 3-4 transition.)

@M3ssman
Copy link
Contributor

M3ssman commented May 5, 2020

@bertsky Sorry, I messed up. Tesseract 4.1.1 produces empty Elements like

<sourceImageInformation>
	<fileName></fileName>
</sourceImageInformation>

Tesseract 4.1.0 produced the tab output

<sourceImageInformation>
	<fileName>			</fileName>
</sourceImageInformation>

@stweil
Copy link
Contributor

stweil commented Aug 8, 2021

Pull request #3517 is merged now in Git master, so this issue can be closed.
Fixing the issue requires API changes, so there won't be a solution for the 4.1 branch.

@stweil stweil closed this as completed Aug 8, 2021
@bertsky
Copy link
Contributor

bertsky commented Aug 16, 2021

Pull request #3517 is merged now in Git master, so this issue can be closed.
Fixing the issue requires API changes, so there won't be a solution for the 4.1 branch.

@stweil, okay so in #3517 you went for the "simpler solution" sketched above, which does not generalize to the multi-input case:

An alternative, much simpler solution could be to just return with an error when ALTO output is requested in the multi-input case. (But some structural changes are required even in the single-input case, because the input filename still needs to enter the preamble.)

Thus, IMO you still need to abort with an error if the ALTO renderer is requested for multi-input (multi-page TIFF or multi-line text file). In the current implementation, only the first page will have the correct sourceImageInformation, the others will silently get the wrong reference.

stweil added a commit that referenced this issue Aug 18, 2021
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Aug 18, 2021

If these are the options, it's preferable to skip the output of data lacking any value.

This is now implemented for 4.1 in commit b2eb72b.

@stweil
Copy link
Contributor

stweil commented Aug 18, 2021

Thus, IMO you still need to abort with an error if the ALTO renderer is requested for multi-input (multi-page TIFF or multi-line text file). In the current implementation, only the first page will have the correct sourceImageInformation, the others will silently get the wrong reference.

For multi-page TIFF the current solution works perfectly: there is only a single image file, and it is correctly named in sourceImageInformation.

If Tesseract processes a list of image files, sourceImageInformation gets the name of the first image file. The ALTO standard does not support a better solution as far as I see. As the rest of the ALTO output is fine, I don't think that we should abort for this use case.

@bertsky
Copy link
Contributor

bertsky commented Aug 18, 2021

For multi-page TIFF the current solution works perfectly: there is only a single image file, and it is correctly named in sourceImageInformation.

Good point!

If Tesseract processes a list of image files, sourceImageInformation gets the name of the first image file. The ALTO standard does not support a better solution as far as I see. As the rest of the ALTO output is fine, I don't think that we should abort for this use case.

Sorry, I thought I had seen a multi-output case before (using outputbase with different suffixes), which would have warranted an expectation on the user's side that the ALTO renderer works. But since that is not the case – CLI output is always single-file – I fully agree aborting would be wrong.

A warning for each single-page renderer (ALTO, hOCR, ...) active in a multi-input run would still be nice, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants