Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watermark detection (but not removal) #29

Open
kanzure opened this issue Jul 8, 2013 · 3 comments
Open

Watermark detection (but not removal) #29

kanzure opened this issue Jul 8, 2013 · 3 comments
Assignees

Comments

@kanzure
Copy link
Owner

kanzure commented Jul 8, 2013

Make a way to detect whether or not a document is likely to have a watermark. There are a few different ways of detection that I can imagine:

  • analyzing a pdf for text that looks like a watermark
  • render pdf to png then analyze the margins for blocks that probably have ip addresses, especially if these blocks are repeated on each page
  • when given a pdf and its source url, have a pre-seeded table of information about whether or not that specific publisher tends to add watermarks
  • given a pdf with no url, have some routines for detecting whether or not the paper was published by Elsevier, Springer, IEEE, or whoever, and then find that publisher in a lookup table to determine whether or not the pdf probably has a watermark

Knowing that there is a watermark present is really helpful, because it means that you can track which percent of your collection is watermarked. Other tools can make informed decisions about what to do with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an unknown unknown.

@ghost
Copy link

ghost commented Jul 8, 2013

Another, half baked thought: Image-ify all pages, discard whitespace, xor against front page or other reference standard to identify pixels that do not vary across pages: this is either margin decoration or a common watermark.

Bryan Bishop notifications@github.com wrote:

Make a way to detect whether or not a document is likely to have a
watermark. There are a few different ways of detection that I can
imagine:

  • analyzing a pdf for text that looks like a watermark
  • render pdf to png then analyze the margins for blocks that probably
    have ip addresses, especially if these blocks are repeated on each page
  • when given a pdf and its source url, have a pre-seeded table of
    information about whether or not that specific publisher tends to add
    watermarks
  • given a pdf with no url, have some routines for detecting whether or
    not the paper was published by Elsevier, Springer, IEEE, or whoever,
    and then find that publisher in a lookup table to determine whether or
    not the pdf probably has a watermark

Knowing that there is a watermark present is really helpful, because it
means that you can track which percent of your collection is
watermarked. Other tools can make informed decisions about what to do
with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an
unknown unknown.


Reply to this email directly or view it on GitHub:
#29

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@kanzure
Copy link
Owner Author

kanzure commented Jul 8, 2013

Cool, but how do you get rid of those elements? You would have to randomly delete pdf elements until the resulting pngs didn't have those images. Might work. Also, this technique would accidentally remove journal titles in margins, which is bad, but okay if there is JSON metadata that is attached to the pdf somehow.

@ghost
Copy link

ghost commented Jul 8, 2013

Unless you brute-force attempted to delete each individual element, I figure it's just a rapid filter to helo detect watermarks. Of course, brute force deletion might assist in creating a pdfparanoia profile for a new publisher, so perhaps the once-off inefficiency would prove worthwhile.

A straight xor would only work if each watermark instance was binary-identical to the next. With any image compression this would likely fail, so perhaps a less stringent comparison, seeking bytes/pixels that vary less than a certain threshold, discarding X outliers based on pagecount..?

Bryan Bishop notifications@github.com wrote:

Cool, but how do you get rid of those elements? You would have to
randomly delete pdf elements until the resulting pngs didn't have those
images. Might work. Also, this technique would accidentally remove
journal titles in margins, which is bad, but okay if there is JSON
metadata that is attached to the pdf somehow.


Reply to this email directly or view it on GitHub:
#29 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@joepie91 joepie91 self-assigned this May 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants