Watermark detection (but not removal) #29

kanzure · 2013-07-08T17:18:41Z

Make a way to detect whether or not a document is likely to have a watermark. There are a few different ways of detection that I can imagine:

analyzing a pdf for text that looks like a watermark
render pdf to png then analyze the margins for blocks that probably have ip addresses, especially if these blocks are repeated on each page
when given a pdf and its source url, have a pre-seeded table of information about whether or not that specific publisher tends to add watermarks
given a pdf with no url, have some routines for detecting whether or not the paper was published by Elsevier, Springer, IEEE, or whoever, and then find that publisher in a lookup table to determine whether or not the pdf probably has a watermark

Knowing that there is a watermark present is really helpful, because it means that you can track which percent of your collection is watermarked. Other tools can make informed decisions about what to do with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an unknown unknown.

ghost · 2013-07-08T17:25:20Z

Another, half baked thought: Image-ify all pages, discard whitespace, xor against front page or other reference standard to identify pixels that do not vary across pages: this is either margin decoration or a common watermark.

Bryan Bishop notifications@github.com wrote:

Make a way to detect whether or not a document is likely to have a
watermark. There are a few different ways of detection that I can
imagine:

analyzing a pdf for text that looks like a watermark

render pdf to png then analyze the margins for blocks that probably
have ip addresses, especially if these blocks are repeated on each page

when given a pdf and its source url, have a pre-seeded table of
information about whether or not that specific publisher tends to add
watermarks

given a pdf with no url, have some routines for detecting whether or
not the paper was published by Elsevier, Springer, IEEE, or whoever,
and then find that publisher in a lookup table to determine whether or
not the pdf probably has a watermark

Knowing that there is a watermark present is really helpful, because it
means that you can track which percent of your collection is
watermarked. Other tools can make informed decisions about what to do
with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an
unknown unknown.

Reply to this email directly or view it on GitHub:
#29

Sent from my Android device with K-9 Mail. Please excuse my brevity.

kanzure · 2013-07-08T17:32:35Z

Cool, but how do you get rid of those elements? You would have to randomly delete pdf elements until the resulting pngs didn't have those images. Might work. Also, this technique would accidentally remove journal titles in margins, which is bad, but okay if there is JSON metadata that is attached to the pdf somehow.

ghost · 2013-07-08T18:06:09Z

Unless you brute-force attempted to delete each individual element, I figure it's just a rapid filter to helo detect watermarks. Of course, brute force deletion might assist in creating a pdfparanoia profile for a new publisher, so perhaps the once-off inefficiency would prove worthwhile.

A straight xor would only work if each watermark instance was binary-identical to the next. With any image compression this would likely fail, so perhaps a less stringent comparison, seeking bytes/pixels that vary less than a certain threshold, discarding X outliers based on pagecount..?

Bryan Bishop notifications@github.com wrote:

Cool, but how do you get rid of those elements? You would have to
randomly delete pdf elements until the resulting pngs didn't have those
images. Might work. Also, this technique would accidentally remove
journal titles in margins, which is bad, but okay if there is JSON
metadata that is attached to the pdf somehow.

Reply to this email directly or view it on GitHub:
#29 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

joepie91 added the enhancement label May 26, 2014

joepie91 self-assigned this May 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watermark detection (but not removal) #29

Watermark detection (but not removal) #29

kanzure commented Jul 8, 2013

ghost commented Jul 8, 2013

kanzure commented Jul 8, 2013

ghost commented Jul 8, 2013

Watermark detection (but not removal) #29

Watermark detection (but not removal) #29

Comments

kanzure commented Jul 8, 2013

ghost commented Jul 8, 2013

kanzure commented Jul 8, 2013

ghost commented Jul 8, 2013