fields with periods are truncated #324

terrafrost · 2023-12-28T19:09:43Z

So I have a PDF with just one field on it - a field named "xxx.yyy". When I run pdf2json 3.0.5 on the PDF I'm told that the only field on that PDF is "yyy".

test.pdf demonstrates the problem.

Here's what Adobe Acrobat Pro 2020 shows:

pdftk 2.02 also finds "xxx.yyy" when I run pdftk test.pdf dump_data_fields:

FieldType: Text
FieldName: xxx.yyy
FieldFlags: 0
FieldJustification: Left

Unfortunately, pdftk doesn't return the coordinates whereas pdf2json does.

According to qpdf test.pdf --json the field's alternativename, fullname and mappingname are "xxx.yyy" whereas the partialname is "yyy" so maybe that's the issue?

The text was updated successfully, but these errors were encountered:

terrafrost · 2024-01-03T22:26:45Z

So I used qpdf's QDF mode (qpdf test.pdf --qdf test.qdf) to further dig into this and I guess the issue is that when there are dots the dots are treated as parent objects.

%% Object stream: object 7, index 2; original object ID: 24
<<
  /DA (/Helv 12 Tf 0 g)
  /F 4
  /FT /Tx
  /MK <<
  >>
  /P 21 0 R
  /Parent 17 0 R
  /Rect [
    190.784
    658.903
    340.784
    680.903
  ]
  /Subtype /Widget
  /T (yyy)
  /Type /Annot
>>

So if you look at the /T tag in isolation you get yyy. The xxx is due to the /Parent 17 0 R bit:

%% Object stream: object 17, index 0; original object ID: 10
<<
  /Kids [
    7 0 R
  ]
  /T (xxx)
>>

So I guess what pdf2json needs to do is to recursively go back and find each parent until there is no parent and it needs to prepend each parent to the /T tag with dots separating each part.

terrafrost · 2024-10-02T13:06:56Z

After I encountered this issue I tried parse the PDF demonstrating this issue with other PDF parsers and https://github.com/smalot/pdfparser was able to parse fields with one dot in them but when the field had two or more dots in it it broke. I filed a bug report against that package and they fixed it. Quoting their response:

The test.pdf document doesn't contain an encoding value, so a default must be assumed.

According to the PDF Reference 1.7, the default encoding should be 'StandardEncoding'. PdfParser currently does not supply any default, so when it queries for the BaseEncoding, it returns an empty string.

I mean, maybe this is more of an issue with https://github.com/mozilla/pdf.js than it is with pdf2json but if that were the case I should think that the devs of this package - with their superior knowledge of the pdf.js API - should be able to create a reproduceable example of the issue using their API and then file a bug report with them...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fields with periods are truncated #324

fields with periods are truncated #324

terrafrost commented Dec 28, 2023 •

edited

Loading

terrafrost commented Jan 3, 2024 •

edited

Loading

terrafrost commented Oct 2, 2024

fields with periods are truncated #324

fields with periods are truncated #324

Comments

terrafrost commented Dec 28, 2023 • edited Loading

terrafrost commented Jan 3, 2024 • edited Loading

terrafrost commented Oct 2, 2024

terrafrost commented Dec 28, 2023 •

edited

Loading

terrafrost commented Jan 3, 2024 •

edited

Loading