PIL Image size limits #168

RichardScottOZ · 2022-08-12T10:24:01Z

Probably not an issue in a current journal based workflow.

However, in older stuff this can tend to happen

 | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | ERROR :: 2022-08-12 10:18:02,769 :: Image size (305490136 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
worker_1         | Traceback (most recent call last):
worker_1         |   File "/ingestion/ingest/ingest.py", line 297, in pdf_to_images
worker_1         |     img = Image.open(bytesio).convert('RGB')
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3009, in open
worker_1         |     im = _open_core(fp, filename, prefix, formats)
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 2996, in _open_core
worker_1         |     _decompression_bomb_check(im.size)
worker_1         |   File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 2905, in _decompression_bomb_check
worker_1         |     raise DecompressionBombError(
worker_1         | PIL.Image.DecompressionBombError: Image size (305490136 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
worker_1         | ERROR :: 2022-08-12 10:18:02,770 :: Image opening error pdf: Rec1951_067.pdf
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 32% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - INFO - full garbage collection released 31.17 MiB from 9565 reference cycles (threshold: 9.54 MiB)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
worker_1         | distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)

Is there any reason not to have the limit be None [other than not having come across it]? Other than RAM could blow up - but at machines the size used here, pretty unlikely - given document size known too.

iross · 2022-08-15T18:27:37Z

Thanks for the report! There's no reason we can't adjust the limit up a bit, but I suspect this is a result of an out-of-distribution document not being handle very well by the segmentation algorithm, meaning it's trying to classify the full page rather than individual sections of it. We'll increase that limit next go-around, but I suspect that results won't be meaningful unless updates to the segmentation process improve the way we're breaking up the sections on the page.

RichardScottOZ · 2022-08-15T22:05:40Z

I have seen this before in older - not very well done or non-standard documents - for some reason this happens.

e.g. doing generic ad hoc extraction of a few things here and there I have come across the problem.

I think in this case would have to have no limit as you just get arbitrarily large numbers.

So possibly would have to reprocess them somehow to work with your pipeline?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIL Image size limits #168

PIL Image size limits #168

RichardScottOZ commented Aug 12, 2022

iross commented Aug 15, 2022

RichardScottOZ commented Aug 15, 2022

PIL Image size limits #168

PIL Image size limits #168

Comments

RichardScottOZ commented Aug 12, 2022

iross commented Aug 15, 2022

RichardScottOZ commented Aug 15, 2022