Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PDF] Fix searching and copy-pasting underscore characters #287

Merged
merged 1 commit into from
Jan 15, 2024

Conversation

atrosinenko
Copy link
Contributor

In the existing setup, the *.md source files are converted to PDF by pandoc that invokes pdflatex internally. With the default font encoding, underscore characters inside paragraphs of text look like whitespace (or absent) in the produced PDF documents w.r.t. copy-pasting text from PDF viewer or searching. This may confuse users as it makes __ARM_FEATURE_name and long_function_name strings invisible to the "Search ..." function of a viewer, but only if they are not inside a standalone block of code.

One of the solutions is to use T1 font encoding and ensure that Type 1 fonts are available (i.e. pdflatex does not have to use rasterized Type 3 fonts).

Fixes #282.


name: Pull request
about: Technical issues, document format problems, bugs in scripts or feature proposal.


Thank you for submitting a pull request!

If this PR is about a bugfix:

Please use the bugfix label and make sure to go through the checklist below.

If this PR is about a proposal:

We are looking forward to evaluate your proposal, and if possible to
make it part of the Arm C Language Extension (ACLE) specifications.

We would like to encourage you reading through the contribution
guidelines
, in particular the section on submitting
a proposal
.

Please use the proposal label.

As for any pull request, please make sure to go through the below
checklist.

Checklist: (mark with X those which apply)

  • If an issue reporting the bug exists, I have mentioned it in the
    PR (do not bother creating the issue if all you want to do is
    fixing the bug yourself).
  • I have added/updated the SPDX-FileCopyrightText lines on top
    of any file I have edited. Format is SPDX-FileCopyrightText: Copyright {year} {entity or name} <{contact informations}>
    (Please update existing copyright lines if applicable. You can
    specify year ranges with hyphen , as in 2017-2019, and use
    commas to separate gaps, as in 2018-2020, 2022).
  • I have updated the Copyright section of the sources of the
    specification I have edited (this will show up in the text
    rendered in the PDF and other output format supported). The
    format is the same described in the previous item.
  • I have run the CI scripts (if applicable, as they might be
    tricky to set up on non-*nix machines). The sequence can be
    found in the contribution
    guidelines
    . Don't
    worry if you cannot run these scripts on your machine, your
    patch will be automatically checked in the Actions of the pull
    request.
  • I have added an item that describes the changes I have
    introduced in this PR in the section Changes for next
    release
    of the section Change Control/Document history
    of the document. Create Changes for next release if it does
    not exist. Notice that changes that are not modifying the
    content and rendering of the specifications (both HTML and PDF)
    do not need to be listed.
  • When modifying content and/or its rendering, I have checked the
    correctness of the result in the PDF output (please refer to the
    instructions on how to build the PDFs
    locally
    ).
  • The variable draftversion is set to true in the YAML header
    of the sources of the specifications I have modified.
  • Please DO NOT add my GitHub profile to the list of contributors
    in the README page of the project.

In the existing setup, the *.md source files are converted to PDF by
pandoc that invokes pdflatex internally. With the default font encoding,
underscore characters inside paragraphs of text look like whitespace
(or absent) in the produced PDF documents w.r.t. copy-pasting text from
PDF viewer or searching. This may confuse users as it makes
`__ARM_FEATURE_name` and `long_function_name` strings invisible to the
"Search ..." function of a viewer, but only if they are not inside a
standalone block of code.

One of the solutions is to use T1 font encoding and ensure that Type 1
fonts are available (i.e. pdflatex does not have to use rasterized
Type 3 fonts).
@atrosinenko
Copy link
Contributor Author

atrosinenko commented Jan 11, 2024

There turned out to be a known issue with expressing underscores in PDFs produced by pdflatex (for example, this question).

Considering possible regressions, this unofficial documentation for fontenc package mentions that switching font encoding may force pdflatex to use raster fonts (are they always Type 3 fonts?). Additionally, in this PR fontenc is loaded as early as possible, but it may be required to load some font-related packages before it. I am not sure if this is the case for inconsolata package - I visually compared a few pages of "old" and "new" version and it looks like the monospaced font have changed (at least the underscore characters) but moving \usepackage{inconsolata} just before the \usepackage[T1]{fontenc} line (in addition to this patch) seems to change nothing.

The PDF documents generated with this patch applied look visually correct and can be searched for identifier names containing underscore characters (though, the layout changed a bit). No new Type 3 fonts are listed in the output of pdffonts, but in the "new" version less fonts are listed.

@atrosinenko
Copy link
Contributor Author

Just in case, here is the output of pdffonts utility for the generated documents:

Without this patch
=== acle.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
HCPDYJ+Lato-Regular                  Type 1            Custom           yes yes no    1782  0
NQPXVC+Lato-Bold                     Type 1            Custom           yes yes no    1784  0
KIYOWP+Lato-Italic                   Type 1            Custom           yes yes no    1786  0
SYFPBV+CMMI10                        Type 1            Builtin          yes yes no    1842  0
YPHFQB+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    1843  0
HCPDYJ+Lato-Regular                  Type 1            Custom           yes yes no    2320  0
UABGXL+CMSY10                        Type 1            Builtin          yes yes no    2539  0
ADRRSK+Inconsolatazi4-Bold           Type 1            Custom           yes yes no    2540  0
KIYOWP+Lato-Italic                   Type 1            Custom           yes yes no    2550  0
ZGGNQH+CMMI12                        Type 1            Builtin          yes yes no    2625  0
YPHFQB+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    2733  0
F69                                  Type 3            Custom           yes no  no    3076  0

=== advsimd.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
LXZGMB+Lato-Regular                  Type 1            Custom           yes yes no    1280  0
KQVPHX+Lato-Bold                     Type 1            Custom           yes yes no    1282  0
DMJYBT+Lato-Italic                   Type 1            Custom           yes yes no    1284  0
SYFPBV+CMMI10                        Type 1            Builtin          yes yes no    1490  0
LXZGMB+Lato-Regular                  Type 1            Custom           yes yes no    1683  0
LHGFCN+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    1695  0
ZGGNQH+CMMI12                        Type 1            Builtin          yes yes no    5233  0

=== cmse.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
BLIBLO+Lato-Regular                  Type 1            Custom           yes yes no     271  0
BLIBLO+Lato-Regular                  Type 1            Custom           yes yes no     272  0
OBTHMW+Lato-Bold                     Type 1            Custom           yes yes no     274  0
OBTHMW+Lato-Bold                     Type 1            Custom           yes yes no     275  0
VEHMTU+Lato-Italic                   Type 1            Custom           yes yes no     277  0
SNQNWU+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     401  0
LOWLTO+CMSY10                        Type 1            Builtin          yes yes no     489  0
YSBLIL+Inconsolatazi4-Bold           Type 1            Custom           yes yes no     490  0
SYFPBV+CMMI10                        Type 1            Builtin          yes yes no     571  0
EURBLZ+DejaVuSans                    TrueType          WinAnsi          yes yes yes    695  0
SPRGBG+DejaVuSans                    TrueType          WinAnsi          yes yes yes    760  0

=== morello.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
QKJZDY+Lato-Regular                  Type 1            Custom           yes yes no     118  0
AIUTIA+Lato-Bold                     Type 1            Custom           yes yes no     120  0
IKKTZB+Lato-Italic                   Type 1            Custom           yes yes no     122  0
UVILRV+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     157  0
QKJZDY+Lato-Regular                  Type 1            Custom           yes yes no     205  0
CKAXET+Inconsolatazi4-Bold           Type 1            Custom           yes yes no     230  0

=== mve.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TMPYZL+Lato-Regular                  Type 1            Custom           yes yes no     407  0
RKHJOK+Lato-Bold                     Type 1            Custom           yes yes no     409  0
FTKNFF+Lato-Italic                   Type 1            Custom           yes yes no     411  0
TMPYZL+Lato-Regular                  Type 1            Custom           yes yes no     568  0
URWTTH+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     569  0
With this patch applied
=== acle.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
HGEBLM+Lato-Regular                  Type 1            Custom           yes yes no    1782  0
WKINJQ+Lato-Bold                     Type 1            Custom           yes yes no    1784  0
KIYOWP+Lato-Italic                   Type 1            Custom           yes yes no    1786  0
UPVBQN+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    1842  0
HGEBLM+Lato-Regular                  Type 1            Custom           yes yes no    2319  0
BMTUMN+Inconsolatazi4-Bold           Type 1            Custom           yes yes no    2538  0
KIYOWP+Lato-Italic                   Type 1            Custom           yes yes no    2548  0
UPVBQN+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    2729  0
F69                                  Type 3            Custom           yes no  no    3072  0

=== advsimd.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
CBTKCX+Lato-Regular                  Type 1            Custom           yes yes no    1280  0
GKMNFC+Lato-Bold                     Type 1            Custom           yes yes no    1282  0
DMJYBT+Lato-Italic                   Type 1            Custom           yes yes no    1284  0
CBTKCX+Lato-Regular                  Type 1            Custom           yes yes no    1682  0
LHGFCN+Inconsolatazi4-Regular        Type 1            Custom           yes yes no    1694  0

=== cmse.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
IOXFSS+Lato-Regular                  Type 1            Custom           yes yes no     271  0
IOXFSS+Lato-Regular                  Type 1            Custom           yes yes no     272  0
OBTHMW+Lato-Bold                     Type 1            Custom           yes yes no     274  0
OBTHMW+Lato-Bold                     Type 1            Custom           yes yes no     275  0
VEHMTU+Lato-Italic                   Type 1            Custom           yes yes no     277  0
QVEWWD+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     401  0
YSBLIL+Inconsolatazi4-Bold           Type 1            Custom           yes yes no     489  0
EURBLZ+DejaVuSans                    TrueType          WinAnsi          yes yes yes    693  0
SPRGBG+DejaVuSans                    TrueType          WinAnsi          yes yes yes    758  0

=== morello.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
QKJZDY+Lato-Regular                  Type 1            Custom           yes yes no     118  0
AIUTIA+Lato-Bold                     Type 1            Custom           yes yes no     120  0
IKKTZB+Lato-Italic                   Type 1            Custom           yes yes no     122  0
UVILRV+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     157  0
QKJZDY+Lato-Regular                  Type 1            Custom           yes yes no     205  0
GQBMYK+Inconsolatazi4-Bold           Type 1            Custom           yes yes no     230  0

=== mve.pdf ===
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
MPFLDN+Lato-Regular                  Type 1            Custom           yes yes no     407  0
RKHJOK+Lato-Bold                     Type 1            Custom           yes yes no     409  0
FTKNFF+Lato-Italic                   Type 1            Custom           yes yes no     411  0
MPFLDN+Lato-Regular                  Type 1            Custom           yes yes no     568  0
URWTTH+Inconsolatazi4-Regular        Type 1            Custom           yes yes no     569  0

Another relevant link: https://tex.stackexchange.com/questions/345866/when-should-package-fontenc-be-used-with-pdflatex

@atrosinenko atrosinenko marked this pull request as ready for review January 12, 2024 12:41
@vhscampos
Copy link
Member

Hi, thanks for your Pull Request. We aim to review it as soon as possible.

@vhscampos
Copy link
Member

LGTM. Thanks for the analysis and the fix!

@vhscampos vhscampos merged commit 0de08fd into ARM-software:main Jan 15, 2024
4 checks passed
@vhscampos
Copy link
Member

@all-contributors please add @atrosinenko for code.

Copy link
Contributor

@vhscampos

I've put up a pull request to add @atrosinenko! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [PDF] Identifiers with underscores are not searchable inside paragraphs
2 participants