Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SILE and XML - A Return on Experience #2111

Open
Omikhleia opened this issue Sep 19, 2024 · 9 comments
Open

SILE and XML - A Return on Experience #2111

Omikhleia opened this issue Sep 19, 2024 · 9 comments

Comments

@Omikhleia
Copy link
Member

Omikhleia commented Sep 19, 2024

So Saith Simon:

SILE is fundamentally a typesetting system for arbitrary XML files.

-- (#508 (comment))

And we obviously mention similar things in the User Manual and other places (such as our Wiki, etc.)

But is that actually true?

And what sort of challenges a package-designer might face when trying to use SILE as a typesetter for "arbitrary" XML files? I have some experience with this,1 and I'd like to share it with you.

SILE ships with an XML streaming parser (Luaexpat a.k.a. lxp), and merely converts XML elements and their attributes to SILE syntax tree nodes, as a straightforward "exact" mapping (tags become commands, attributes become options, text nodes becomes content).

It doesn't do anything special with the XML, and notably:

  1. It has no knowledge of the document model (old-school DTDs, or more modern XML schemas; both go beyond the question of the "validity" of the input, they also define default/fixed values for attributes, the interpretation of spaces in some elements, etc.)
  2. It doesn't care about namespacing, and potential conflicts between XML elements from different types of documents (preventing their use in the same document), and even the core SILE commands. Note that I am not specifically referring to XML namespaces in a document (xmlns, xmlns:xxx) specifically, though that is also a part of the equation; but rather to conflicts between different input documents. Let's call the former "internal namespacing" and the latter "external namespacing".
  3. Paragraphing is done by interpreting empty lines (precisely, double newlines) in the SILE typesetter, not in the inputter(s), which may be a problem for some XML documents.

Thoughts on document models

The problem in this space is that XML is a very broad format, with different applications.
But in practice, there are two different types of XML content (possibly, and actually often, mixed in a given schema/model, depending on context):

  • Presentational: XML that is meant to be rendered as-is, with minimal processing (i.e. more or less consisting in styled text, in the same order)
  • Stuctural: XML that is meant to be processed and rendered in a specific way (i.e. semantic information, where the order of elements is not necessarily the order of rendering)

This is a first-order approximation, but it's a useful distinction.2

So beyond it's naive parser, what does SILE for us here?

Not much, so:

  • Either one has to walk the syntax tree, and ignore spaces nodes. This somewhat mundane, and inefficient. Right, we do have SU.ast.walkContent() (original, deeply recursive, not seen in any code base), and SU.ast.processAsStructure() (which I introduced recently and used in my modules), but they fail on some non-trivial cases, where one has to loop on the content manually, anyway...3
  • Or: One can skip SILE's default XML parser, and uses a modified version of it, with some of the semantics known for the target schema. This is what I used in CSL support #2082 (see csl/core/utils/xmlparserin the current PR), with a few customizable "rules", albeit restrained to a minimal set of features (spaces and namespaces). We could go further, and generalize the idea, importing it in the XML inputter, with a few add-ons (e.g. better content appropriation strategies.)
  • Or: We eventually ditch the current parser (AFAIK, there's no schema/DTD-compliant library for Lua), and use an entirely different approach for a more clever parsing. (I.e. other libraries, whether via Rust or not, etc.)...

Thoughts on external namespacing

It should be obious that complex enough schemas with have conflicts with other schemas, including SILE's own.
Say a document encodes chapters as <chapter><title> ... </title><body> ... </body></chapter>. We'd like to use our existing book class, but wait... We need to save our chapter command, which has a wholly different structure, and use that saved version in our re-implementation. Ah. But then other documents are in trouble. And it's a lot of potential command saving/restoring, notwithstanding 3rd-party package expectations...

In my above-mentioned approach (#2082), I've used a simple "namespace" mechanism (= read a "prefix"). It's not really used for CSL (which is processed differently), but heh! I just imported the idea from my other in-progress projects in SILE.

Thoughts on internal namespacing

A document could include, say, an SVG not wrapped in a CDATA (uh-oh), but simply with a namespace declared on the root element, or actually any element.

<root xmlns:svg="http://www.w3.org/2000/svg">
    ...
    <svg:svg>
        <vector:circle cx="50" cy="50" r="40" stroke="black" stroke-width="3" fill="red" />
    </svg:svg>
    ...
</root>

That's the most usual way, but note the namespace prefix doesn't even has to be svg, it could be anything...
So we'd need a special provision for explicitly namespaced elements (Luaexpat has some options, but I am not sure they are what we'd want here)...

Notes on the root element

By the way, currently, the XML inputter wraps the parse tree in a document command node, it it does start with a <document>̀ tag4 with no class (plain applies) and no clean strategy how to load the necessary tag support (enforcing a class from command line? Dubious at best...; using a wrapper document is better, but not straightforward; a preamble too is possible...)5

Thoughts on paragraphing

i'm less advanced in my thoughts here, but I have a deep feeling that paragraphing done at the typesetter's level (typesetter.parseppattern) is inherently wrong, and that it should be done at the inputter's level.
To be honest, I even feel the newPar/endPar typesetter hooks into the class are not that great, and that we should have a different general approach to this. Our syntax tree is not even really an AST. The latter would have explicit paragraph nodes, where appropriate. (I have a few ideas on this, but I'll keep them for another time.)

Concluding remarks

Thanks for reading this far, if you made it. I hope I've given you some food for thought, and that you'll consider these points in the future. I'm happy to discuss any of these points further.

Footnotes

  1. In my now >3-year involvement with SILE, I've implemented parsers and or processors for 2 subsets of TEI XML (dictionaries, critical apparatus), a substantial portion of 2 biblical scripture XML schemas (USX, and a prior attempt at USFX), CSL (locales, styles), and an attempt at reviewing SILE's DocBook support. In the same time frame, I have not seen any other SILE developer or user attempt to use SILE as a typesetter for other arbitrary XML documents.

  2. For instance, <strong>text</strong> is <em>good,</em> always is presentational, and all spaces are significant.
    On the other hand,

    <cites>
      <cite>ref1</cite>
    
        <cite>ref2</cite>
    </cites>
    

    ... is structural, and spaces or linebreaks are not significant. Order might be...

    <form>
      <orth>elementary</orth>
      <hyph>el|emen|tary</hyph>
      <pron>%El@"mEn%t@ri</pron>
    </form>
    

    ... is also structural at the "form" level, but the order of elements is not significant, and the rendering might reshuffle them as, say: "elementary /ˌɛləˈmɛntəri/ (el-emen-tary)". Just to show that the developer will have some already complex code strategies to implement this, and anything that would help alleviate the burden would be welcome. One will already have a lot of SU.findInTree()/SU.removeFromTree() calls to do, involve several inputfilter or string parsing, and that's just the beginning...

    EDIT: feat: DocBook class overhaul #1789 for docbook support is far from complete... #1338 is full of such syntax tree ad-hoc operations, far beyond decency.

  3. See also lists package: enforceListType precludes XML handling #2073, with attempts to use a "schema" (SILE's lists) for another (HTML lists), and the difficulties encountered in the process. I promised a discussion in my comments: Well, this is it.

  4. Or <sile> tag, see also Top level tags differ between XML/TeX flavors #508, an old discussion that also points towards a sensible resolution...

  5. So we have many "workarounds". But in reality, we might be more frequently in the "wrapper" scenario for most real-world cases. Personally, I'd use a "master document" (cool re·sil·ient stuff), for metadata, book covers, etc. so the XML would end up just as an included fragment. (My approach even here https://github.com/Freely-Given-org/BibleTypesetter/pull/3)

@Omikhleia
Copy link
Member Author

Omikhleia commented Sep 19, 2024

Point 1 above, I forgot to say, tangentially relates to #1957. I might be wrong but one seldom find xml:space in XML documents. Why? AFAIK, cause in modern days, its often implied by the schemas, when it wasn't the earlier DTDs - see #FIXED)

@RobH123

This comment was marked as off-topic.

@Omikhleia

This comment was marked as off-topic.

@Omikhleia

This comment was marked as off-topic.

@RobH123

This comment was marked as off-topic.

@RobH123

This comment was marked as off-topic.

@RobH123

This comment was marked as off-topic.

@Omikhleia

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@RobH123 @Omikhleia and others