Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fields with periods are truncated #324

Open
terrafrost opened this issue Dec 28, 2023 · 2 comments
Open

fields with periods are truncated #324

terrafrost opened this issue Dec 28, 2023 · 2 comments

Comments

@terrafrost
Copy link

terrafrost commented Dec 28, 2023

So I have a PDF with just one field on it - a field named "xxx.yyy". When I run pdf2json 3.0.5 on the PDF I'm told that the only field on that PDF is "yyy".

test.pdf demonstrates the problem.

Here's what Adobe Acrobat Pro 2020 shows:

image

pdftk 2.02 also finds "xxx.yyy" when I run pdftk test.pdf dump_data_fields:

FieldType: Text
FieldName: xxx.yyy
FieldFlags: 0
FieldJustification: Left

Unfortunately, pdftk doesn't return the coordinates whereas pdf2json does.

According to qpdf test.pdf --json the field's alternativename, fullname and mappingname are "xxx.yyy" whereas the partialname is "yyy" so maybe that's the issue?

@terrafrost
Copy link
Author

terrafrost commented Jan 3, 2024

So I used qpdf's QDF mode (qpdf test.pdf --qdf test.qdf) to further dig into this and I guess the issue is that when there are dots the dots are treated as parent objects.

%% Object stream: object 7, index 2; original object ID: 24
<<
  /DA (/Helv 12 Tf 0 g)
  /F 4
  /FT /Tx
  /MK <<
  >>
  /P 21 0 R
  /Parent 17 0 R
  /Rect [
    190.784
    658.903
    340.784
    680.903
  ]
  /Subtype /Widget
  /T (yyy)
  /Type /Annot
>>

So if you look at the /T tag in isolation you get yyy. The xxx is due to the /Parent 17 0 R bit:

%% Object stream: object 17, index 0; original object ID: 10
<<
  /Kids [
    7 0 R
  ]
  /T (xxx)
>>

So I guess what pdf2json needs to do is to recursively go back and find each parent until there is no parent and it needs to prepend each parent to the /T tag with dots separating each part.

@terrafrost
Copy link
Author

After I encountered this issue I tried parse the PDF demonstrating this issue with other PDF parsers and https://github.com/smalot/pdfparser was able to parse fields with one dot in them but when the field had two or more dots in it it broke. I filed a bug report against that package and they fixed it. Quoting their response:

The test.pdf document doesn't contain an encoding value, so a default must be assumed.

According to the PDF Reference 1.7, the default encoding should be 'StandardEncoding'. PdfParser currently does not supply any default, so when it queries for the BaseEncoding, it returns an empty string.

I mean, maybe this is more of an issue with https://github.com/mozilla/pdf.js than it is with pdf2json but if that were the case I should think that the devs of this package - with their superior knowledge of the pdf.js API - should be able to create a reproduceable example of the issue using their API and then file a bug report with them...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant