Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIL Image size limits #168

Open
RichardScottOZ opened this issue Aug 12, 2022 · 2 comments
Open

PIL Image size limits #168

RichardScottOZ opened this issue Aug 12, 2022 · 2 comments

Comments

@RichardScottOZ
Copy link
Contributor

Probably not an issue in a current journal based workflow.

However, in older stuff this can tend to happen

 | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | ERROR :: 2022-08-12 10:18:02,769 :: Image size (305490136 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
worker_1         | Traceback (most recent call last):
worker_1         |   File "/ingestion/ingest/ingest.py", line 297, in pdf_to_images
worker_1         |     img = Image.open(bytesio).convert('RGB')
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3009, in open
worker_1         |     im = _open_core(fp, filename, prefix, formats)
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 2996, in _open_core
worker_1         |     _decompression_bomb_check(im.size)
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 2905, in _decompression_bomb_check
worker_1         |     raise DecompressionBombError(
worker_1         | PIL.Image.DecompressionBombError: Image size (305490136 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
worker_1         | ERROR :: 2022-08-12 10:18:02,770 :: Image opening error pdf: Rec1951_067.pdf
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - INFO - full garbage collection released 31.17 MiB from 9565 reference cycles (threshold: 9.54 MiB)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)

Is there any reason not to have the limit be None [other than not having come across it]? Other than RAM could blow up - but at machines the size used here, pretty unlikely - given document size known too.

@iross
Copy link
Contributor

iross commented Aug 15, 2022

Thanks for the report! There's no reason we can't adjust the limit up a bit, but I suspect this is a result of an out-of-distribution document not being handle very well by the segmentation algorithm, meaning it's trying to classify the full page rather than individual sections of it. We'll increase that limit next go-around, but I suspect that results won't be meaningful unless updates to the segmentation process improve the way we're breaking up the sections on the page.

@RichardScottOZ
Copy link
Contributor Author

I have seen this before in older - not very well done or non-standard documents - for some reason this happens.

e.g. doing generic ad hoc extraction of a few things here and there I have come across the problem.

I think in this case would have to have no limit as you just get arbitrarily large numbers.

So possibly would have to reprocess them somehow to work with your pipeline?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants