Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract ocr big size pic dump #3885

Open
wuyang-dl opened this issue Aug 2, 2022 · 2 comments
Open

tesseract ocr big size pic dump #3885

wuyang-dl opened this issue Aug 2, 2022 · 2 comments

Comments

@wuyang-dl
Copy link

hi,
void TessBaseAPI::SetImage(Pix *pix) API function has a coredump problem when handling a big size pic(system memory no enough)

void TessBaseAPI::SetImage(Pix *pix) {
if (InternalSetImage()) {
if (pixGetSpp(pix) == 4 && pixGetInputFormat(pix) == IFF_PNG) {
// remove alpha channel from png
Pix *p1 = pixRemoveAlpha(pix);
pixSetSpp(p1, 3);
(void)pixCopy(pix, p1); <---- bug
pixDestroy(&p1);
}
thresholder_->SetImage(pix);
SetInputImage(thresholder_->GetPixRect());
}
}

pixCopy(pix, p1) function in leptonica, return pixd, or NULL on error
so it is necessary to check pixCopy return val.

Environment

  • Tesseract Version: 5.2.0
  • Commit Number:
  • Platform: Windows10 32-bit, I think other platforms have the same problem

Current Behavior:

tesseract dump

Expected Behavior:

tesseract ocr ok(not dump)

Suggested Fix:

Possible fix:
void TessBaseAPI::SetImage(Pix *pix) {
if (InternalSetImage()) {
if (pixGetSpp(pix) == 4 && pixGetInputFormat(pix) == IFF_PNG) {
// remove alpha channel from png
Pix *p1 = pixRemoveAlpha(pix);
pixSetSpp(p1, 3);

  // fix-begin
   if ( pixCopy(pix, p1) == NULL) {
      pixDestroy(&p1);
      recognition_done_ = false;  //maybe
      return ;
  }
 // fix-end

  pixDestroy(&p1);
}
thresholder_->SetImage(pix);
SetInputImage(thresholder_->GetPixRect());

}
}

tks

GerHobbelt added a commit to GerHobbelt/tesseract that referenced this issue Feb 27, 2023
… input images. Available to both userland and tesseract internal code, these can be used to report & early fail images which are too large to fit in memory.

Some very lenient defaults are used for the memory pressure allowance (1.5 GByte for 32bit builds, 64GByte for 64bit builds) but this can be tweaked to your liking and local machine shop via Tesseract Global Variable `allowed_image_memory_capacity` (DOUBLE type).

NOTE: the allowance limit can be effectively removed by setting this variable to an 'insane' value, e.g. `1.0e30`.
HOWEVER, the CheckAndReportIfImageTooLarge() API will still fire for images with either width or high dimension >= TDIMENSION_MAX, which in the default built is the classic INT16_MAX (32767px); when compiled with defined(LARGE_IMAGES), then the width/height limit is raised to 24bit i.e. ~ 16.7 Mpx, which would then tolerate images smaller than 16777216 x 16777216px. (This latter part is a work-in-progress.)

Related:

- tesseract-ocr#3184
- tesseract-ocr#3885
- tesseract-ocr#3435 (pullreq by @stweil -- WIP)

# Conflicts:
#	src/api/baseapi.cpp
#	src/ccmain/tesseractclass.h
#	src/ccmain/thresholder.cpp
#	src/ccutil/params.h
#	src/textord/tordmain.cpp
GerHobbelt added a commit to GerHobbelt/tesseract that referenced this issue Feb 27, 2023
… input images. Available to both userland and tesseract internal code, these can be used to report & early fail images which are too large to fit in memory.

Some very lenient defaults are used for the memory pressure allowance (1.5 GByte for 32bit builds, 64GByte for 64bit builds) but this can be tweaked to your liking and local machine shop via Tesseract Global Variable `allowed_image_memory_capacity` (DOUBLE type).

NOTE: the allowance limit can be effectively removed by setting this variable to an 'insane' value, e.g. `1.0e30`.
HOWEVER, the CheckAndReportIfImageTooLarge() API will still fire for images with either width or high dimension >= TDIMENSION_MAX, which in the default built is the classic INT16_MAX (32767px); when compiled with defined(LARGE_IMAGES), then the width/height limit is raised to 24bit i.e. ~ 16.7 Mpx, which would then tolerate images smaller than 16777216 x 16777216px. (This latter part is a work-in-progress.)

Related:

- tesseract-ocr#3184
- tesseract-ocr#3885
- tesseract-ocr#3435 (pullreq by @stweil -- WIP)
@amitdo
Copy link
Collaborator

amitdo commented Apr 23, 2023

@stweil,

IMO, we should undo 57b79742920c

@stweil
Copy link
Contributor

stweil commented Apr 23, 2023

That would not fix the issue here which is caused by missing error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants