Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build with pdf2htmlEX #380

Merged
merged 106 commits into from
Sep 14, 2024
Merged

Build with pdf2htmlEX #380

merged 106 commits into from
Sep 14, 2024

Conversation

ViliusSutkus89
Copy link
Contributor

WIP. Currently just linking. Need to add actual usage

@ViliusSutkus89
Copy link
Contributor Author

I'm gonna go with the conclusion that conan with msvc-1939 is broken:

freetype/2.13.2: Meson configure cmd: meson setup --native-file "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release\conan\conan_meson_native.ini" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\src" --prefix=/
freetype/2.13.2: RUN: meson setup --native-file "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release\conan\conan_meson_native.ini" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\build-release" "C:\Users\runneradmin\.conan2\p\b\freet6a5c8a94fbf5c\b\src" --prefix=/
conanvcvars.bat: Activating environment Visual Studio 17 - amd64 - winsdk_version=None - vcvars_ver=14.3
[ERROR:vcvars.bat] Toolset directory for version '14.3' was not found.
[ERROR:VsDevCmd.bat] *** VsDevCmd.bat encountered errors. Environment may be incomplete and/or incorrect. ***
[ERROR:VsDevCmd.bat] In an uninitialized command prompt, please 'set VSCMD_DEBUG=[value]' and then re-run
[ERROR:VsDevCmd.bat] vsdevcmd.bat [args] for additional details.
[ERROR:VsDevCmd.bat] Where [value] is:
[ERROR:VsDevCmd.bat]    1 : basic debug logging
[ERROR:VsDevCmd.bat]    2 : detailed debug logging
[ERROR:VsDevCmd.bat]    3 : trace level logging. Redirection of output to a file when using this level is recommended.
[ERROR:VsDevCmd.bat] Example: set VSCMD_DEBUG=3
[ERROR:VsDevCmd.bat]          vsdevcmd.bat > vsdevcmd.trace.txt 2>&1

Do we need msvc-1939 support for odr.core? Freetype is not our recipe, so this may eventually be solved without our intervention. We could set up a nice test case and report the issue to conan though

@ViliusSutkus89
Copy link
Contributor Author

Fontforge fails to compile with msvc-19.40. I've checked Fontforge and they do actually ship Windows binaries, although I'm unsure if they compile with msvc. Not sure if this is worth pursuing, conan-odr-index should be the first place where we add extra compilers or windows support. Disabling msvc compiler here.

@andiwand
Copy link
Member

andiwand commented Aug 8, 2024

We don't really use the msvc build but I like to keep it because it picks on stuff the other compilers don't. Generally I think pdf2htmlex should be a feature we can toggle at compile time so I would just turn it off for windows.

@ViliusSutkus89
Copy link
Contributor Author

Brought back the msvc compilers, 19.39 and 19.40, just for good measure.

Anyway, I would need some help integrating on the c++ side. With the current code on this PR, #include <pdf2htmlEX.h> works.

I assume the integration should happen somewhere in src/odr/html.cpp:

Html html::translate(const PdfFile &pdf_file, const std::string &output_path,
                     const HtmlConfig &config) {
  fs::create_directories(output_path);
#if WITH_PDF2HTMLEX
  return pdf2htmlex_wrapper_which_returns_html_object(pdf_file, output_path, config);
#else
  return internal::html::translate_pdf_file(pdf_file, output_path, config);
#endif
}

Any other ideas?

@andiwand
Copy link
Member

andiwand commented Aug 9, 2024

I think this makes a lot of sense yes. I don't think we need this runtime switchable. Even if we can change this in the future.

conanfile.py Outdated Show resolved Hide resolved
Copy link
Member

@andiwand andiwand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what we can do about the resource files @ViliusSutkus89 ? I guess conan can shit these to the platform and we can figure out something there?

I have some mechanism to transform resources into code so we don't need extra files. But I think that would need a rework to make it also work for pdf2htmlex

@ViliusSutkus89
Copy link
Contributor Author

Resource files are ugly, I gather them in odr.droid/app/CMakeLists.txt and then tell gradle that asset merging has a dependency on CMake - https://github.com/opendocument-app/OpenDocument.droid/blob/d26811deef04dbba6d70b97c01967ec72557d641/app/build.gradle#L120 . I doubt it's possible to make the gathering somehow nicer.

Runtime side could be nicer though. Currently we extract them right before calling pdf2htmlEX. Would be possible to move that to app init, but it would make app init extra slow, so I don't really want to go that route.

Proper implementation would be to read assets in C++ directly from APK, using AssetManager, but I can't recall right now what was wrong with that route. Maybe it requires a high min API level. It definitely requires having access to the JNI Env object and AssetManager, obtained from Android's Context. These objects are bound to the specific Java thread, which does JNI, pdf2htmlEX doesn't do threading, I'm unsure about poppler. Anyway, this implementation requires some code.

@ViliusSutkus89
Copy link
Contributor Author

I've finally figured out why the tests were failing, turns out I've forgot to feed test VMs with ~/.conan2/p, which contains assets. Reference outputs are now generated and committed into appropriate repos. Naturally, I've had to run into another problem.

Why is actions/checkout getting old submodule revisions? Do I need to manually update git revisions somewhere?

Checkout with submodules should just work, right?

- uses: actions/checkout@v4
  with:
    submodules: true

.gitmodules does not list any revisions:

[submodule "test/data/input/odr-public"]
	path = test/data/input/odr-public
	url = https://github.com/opendocument-app/OpenDocument.test.git
[submodule "test/data/reference-output/odr-public"]
	path = test/data/reference-output/odr-public
	url = https://github.com/opendocument-app/OpenDocument.test.output.git
[submodule "test/data/input/odr-private"]
	path = test/data/input/odr-private
	url = git@github.com:opendocument-app/OpenDocument.test-private.git
[submodule "test/data/reference-output/odr-private"]
	path = test/data/reference-output/odr-private
	url = git@github.com:opendocument-app/OpenDocument.test-private.output.git

OpenDocument.test-private

From https://github.com/opendocument-app/OpenDocument.test-private
 * branch            508.... -> FETCH_HEAD
Submodule path 'test/data/input/odr-private': checked out '508...'

508.... is the third most recent commit in that repo. Commited on Dec 30, 2021. There are two more recents commits, made in summer of 2022.

OpenDocument.test

opendocument-app/OpenDocument.test checks the correct revision, c2cc81ba91b6145ff51801644169f4f01878556b is the HEAD

Submodule path 'test/data/input/odr-public': checked out 'c2cc81ba91b6145ff51801644169f4f01878556b'

OpenDocument.test-private.output

From https://github.com/opendocument-app/OpenDocument.test-private.output
 * branch            1b5... -> FETCH_HEAD
Submodule path 'test/data/reference-output/odr-private': checked out '1b5...'

1b5... is second most recent commit, checkout skipped my recently added commit.

OpenDocument.test.output

From https://github.com/opendocument-app/OpenDocument.test.output
 * branch            6138deea822cc17e940181fe3d99b6b6aef64551 -> FETCH_HEAD
Submodule path 'test/data/reference-output/odr-public': checked out '6138deea822cc17e940181fe3d99b6b6aef64551'

6138deea822cc17e940181fe3d99b6b6aef64551 is third most recent commit.

@andiwand
Copy link
Member

andiwand commented Sep 3, 2024

I think for now it would be fine to smoke test the two libraries. Just picking one input file for each of them and translate it to a random directly checking that it does not fail. For the ref comparison I would try to integrate it into the existing translation mechanism first

@ViliusSutkus89
Copy link
Contributor Author

We are way past the smoke test now. Ref testing is already implemented. It's just that a fresh git clone with recursive submodules clones wrong reference outputs and that's why the CI fails.

Just checked and realized that a local git clone also has this behavior, so it's not GitHub actions issue, more of a git issue

@ViliusSutkus89
Copy link
Contributor Author

Turns out I have to update git submodules in the root project too

@ViliusSutkus89
Copy link
Contributor Author

Another issue:

 ├── pdf2htmlEX
│   ├── pdf
│   │   ├── empty.pdf
│   │   │   ├── document.html ✓
│   │   ├── style-various-1.pdf
Traceback (most recent call last):
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 233, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 223, in main
    result = print_results(results, args.a, args.b)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 184, in print_results
    sub_result = print_results(sub_results,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 160, in print_results
    cmp = future.result()
          ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 85, in compare
    return compare_files(path_a,
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 49, in compare_files
    return compare_html(a, b, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/compare_output.py", line 33, in compare_html
    diff, (image_a, image_b) = html_render_diff(a, b, browser=browser)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 43, in html_render_diff
    image_a = screenshot(browser, to_url(a))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 23, in screenshot
    png = body.screenshot_as_png
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 325, in screenshot_as_png
    return b64decode(self.screenshot_as_base64.encode("ascii"))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 314, in screenshot_as_base64
    return self._execute(Command.ELEMENT_SCREENSHOT)["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
    self.error_handler.check_response(response)
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: [Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: chrome://remote/content/shared/Capture.sys.mjs :: capture.canvas :: line 137"  data: no]
Stacktrace:
capture.canvas@chrome://remote/content/shared/Capture.sys.mjs:137:62
takeScreenshot@chrome://remote/content/marionette/actors/MarionetteCommandsParent.sys.mjs:292:37

Still no idea what that means

test/CMakeLists.txt Outdated Show resolved Hide resolved
@andiwand
Copy link
Member

andiwand commented Sep 4, 2024

Not sure if mixing the test file output between integrated and not integrated parts is a good idea. I see that adding the new API is the easiest solution for now but the new tests couple quite tightly to the new api and the existing test repos.

@ViliusSutkus89
Copy link
Contributor Author

I could put pdf2htmlEX/wvWare reference outputs in
test/data/reference-output/odr-public/output-pdf2htmlEX instead of
test/data/reference-output/odr-public/output/pdf2htmlEX and test outputs in
build/test/output/odr-public/output-pdf2htmlEX instead of
build/test/output/odr-public/output/pdf2htmlEX.

loaded_page_settling_time = 0

# Selenium doesn't like when we try to screenshot <body> element of documents generated by pdf2htmlEX
if 'output-pdf2htmlEX' in url:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this looks ugly, but do we have a way to do this properly? I could add --pdf2htmlEX-workaround argument, but it would also need to be added to scripts/compare_output.py and pass through a bunch of functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not great - do you know if the screenshot will somehow fail or give some deterministic broken result? I wondered if we could have a fallback instead of trying to detect who generated the PDF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selenium fails to screenshot the element of pdf2htmlEX's htmls. An exception is thrown.

   File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 23, in screenshot
    png = body.screenshot_as_png
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 325, in screenshot_as_png
    return b64decode(self.screenshot_as_base64.encode("ascii"))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 314, in screenshot_as_base64
    return self._execute(Command.ELEMENT_SCREENSHOT)["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
    self.error_handler.check_response(response)
  File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: [Exception... "Failure"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: chrome://remote/content/shared/Capture.sys.mjs :: capture.canvas :: line 137"  data: no]
Stacktrace:
capture.canvas@chrome://remote/content/shared/Capture.sys.mjs:137:62
takeScreenshot@chrome://remote/content/marionette/actors/MarionetteCommandsParent.sys.mjs:292:37

I've tried to solve it properly, but couldn't figure it out. Maybe because the generated body has too big dimensions? It works when I screenshot the inner <div id="page-container"> element

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we catch the exception and then fall back to your solution?

@ViliusSutkus89 ViliusSutkus89 marked this pull request as ready for review September 5, 2024 01:36
src/odr/internal/project_info.hpp.in Show resolved Hide resolved
test/scripts/html_render_diff.py Show resolved Hide resolved
loaded_page_settling_time = 0

# Selenium doesn't like when we try to screenshot <body> element of documents generated by pdf2htmlEX
if 'output-pdf2htmlEX' in url:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not great - do you know if the screenshot will somehow fail or give some deterministic broken result? I wondered if we could have a fallback instead of trying to detect who generated the PDF

@andiwand
Copy link
Member

Let's get this in! I will try to follow up on some of my suggestions here #387 and will revisit the screenshotting there as well

@andiwand andiwand merged commit f1e1007 into main Sep 14, 2024
18 checks passed
@andiwand andiwand deleted the pdf2htmlex-conan branch September 14, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants