Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sch:visit element #51

Open
rjelliffe opened this issue Aug 29, 2022 · 0 comments
Open

sch:visit element #51

rjelliffe opened this issue Aug 29, 2022 · 0 comments
Labels
enhancement Adds new capabilities

Comments

@rjelliffe
Copy link
Member

rjelliffe commented Aug 29, 2022

I would like to propose a new element for Schematron, intended to allow a stylesheet to declare its required behaviour better, to improve the power of phases and roles, to increase clarity in the schema, and potentially to substantially improve efficiency by reducing unnecessary processing.

The element is optional: /sch:schema/sch:visit, sch:phase/sch:visit, sch:phase/sch:active/sch:visit and sch:pattern/sch:visit. The sch:visit attribute would be standard for any Schematron, but the attributes used depends on the QLB. The default is suited for any QLB where the document is XML (or viewed as XML).

The sch:visit element declares

  1. What kind of infoset is assumed/required
  2. What type of nodes should be visited (to pattern granularity)
  3. Whether validation should be restricted to some branch (e.g in the current phase)
  4. A priority and declaration for @ROLE values

An example is this:

<sch:schema ... queryBinding="xslt2" >
    <sch:visit 
           elements="yes"
           attributes="no"     
           text="yes"
           comment="no"
           processing-instruction="no"
          infoset="xml entities dtd valid"
           branch="/"
           role-priority="fatal error warn info tip"
   />

This declares that the engine needs to visit and validate elements and attributes, but not other kinds of nodes. Roughly, if an engine does not support visiting attributes, it should generate a warning at its start when seeing attributes="yes".

This also allows an engine to select a visiting strategy that is optimal for the document. An implementation may override these on the commandline (or a user could edit the file) to switch off or prioritize validating certain items, or limit the start-point of the validation to a certain branch.

Lexical

The effective pseudo-DTD would be

<!ATTLIST sch:visit
            -- yes = true is required to visit that node, no = false is required to not visit that node, 
                      "auto" allows detection of needed nodetypes from the @context xpaths.
                       by default, auto inherits the next higher sch:visit  --> 
           elements ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "yes" if auto not implemented --
           attributes ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           text          ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           comment ( "yes" | "no" | "true" | "false" | "auto" ) "auto"                        -- "auto" defaults to "no" if auto not implemented --
           processing-instruction ( "yes" | "no" | "true" | "false" | "auto" ) "auto"    -- "auto" defaults to "no" if auto not implemented --

         -- Which nodes need to be visited? see later  --
          infoset  NMTOKENS  "xml"     
          -- The branch to start validation from --
           branch CDATA "/" 
         -- Significant values used in @role (multiple tokens need to be allowed in @role), and their priority -->
         role="fatal BASIC error warning DETAIL info tip"
>

Node Visiting

  • sch:pattern/sch:visit declares what nodes the pattern needs to visit in its document)
  • sch:phase/sch:visit declares what nodes need to be looked at in that phase in any document
  • sch:phase/sch:active/sch:visit declare what nodes need to be looked at in that phase in any document
  • sch:schema/sch:visit declares what nodes need to be looked at in the default document.

For the node visiting, a lower-level element restricts a higher-level one, in the priority defaults/implementation-override/schema/phase/active/pattern. In effect, whether a pattern visits a certain kind of node is the AND of all the in-scope sch:visit attributes.

  • If a node type is not specified, visiting default to any ISO Schematron rules
  • If an implementation decides to restrict application of validation to certain node-types, it can do so.
  • If the sch:schema/sch:visit has attribute="no" and comment="yes", then a sch:pattern inherits them by default.

An implementation can override these (limit further).

Alternatives (visiting by node type)

The problem that this solves is that XPath is very complicate to parse, so that it is a non-trivial thing to look through each @context to see whether it looks at text and attributes. Now you can get a pretty good hint in some cases (does the XPath contain "processing-instruction(" or "comment("?.

For example, an implementer might decide "if none of the @contexts in rules in active patterns contain 'attribute::' or '@' or 'attribute' then I don't need to visit attributes". But that would result in lots of unecessary visits; so it would need to be coupled with some simpler parse of the @context Xpath. For example, to produce an Xpath with all predicates removed, so that the only place for @ or attribute:: was in the location steps.

However, who has produced this simpler parser/stripper?

So having an ability to declare explicitly what kind of visiting is required does open up the door for phases (and particular patterns within a phase) to have an optimal visiting strategy.

@ branch

The branch attribute takes subsets of XPath that specify one or more elements: validation starts at (and under) those elements.

  1. a simplified XPath to an element: absolute, wildcarded namespace and wildcardable name, and position predicate (last is optional), which locates a single element node that is the branch to be validated, with no "//" anywhere. E.g. /*:book[1]/*:appendix[3] or /*[1]/purchase-order[5]/item
  2. an absolute descendent search (starts with "//") and a single optional position predicate e.g. //footnote or //chapter[1]
  • The first case limits the scope of the schema to a certain branch of the document only. This can help reduce unnecessary content matching, where the pattern contents are only in some branch.
  • The second form is for a compete traversal of the document, selecting either all the elements with that name, or the _n_th one.

The XPath subset is simple enough to be trivally parsed in XPath using tokeniser. To provide more of Xpath would be an implementation problem (e.g. for people in the MS ecosystem who have to use XSLT 1.0 still.) But there are all sorts of possibilities

In particular, @ branch accepts the Xpaths found in the svrl:failed-assert/@location attributes, which means in an loosely-interactive application, the editor can go from an error report to a re-validation fast, or to just validate the current element.

A parallelizer could run validation of each branch-start-element in a separate thread.

(At worst, an implementation could implement this by filtering out SVRL elements: that would not have performance benefits for the validation, but could still reduce processing cost or complexity on the user side.)

@ infoset - making feather-dusting practical

The @ infoset attribute has a little Domain Specific Language, just keywords. It specifies what kind of infoset the schema requires. As far as I know, there is no standard method for doing this (even in DSDL, and the schema PI is not powerful enough), which means even as simple a question as ''can my Schematron schema assume that DTD or XSD defaults have been put in?'' has no way to be specified.

                  " xml",  ( ("xinclude" | "dtd" | "xsd" | "rng" | * ) ("expand" | "type" | "validate" )+)*   |   "psvi"  | (*)*  
  •         e.g. "xml" is plain old standalone XML, no XInclude, no entitiy inclusion, no DTD processing for IDs and defaults, no validation
    
  •         e.g. "psvi" is an alias for Post Schema Validation Infoset, and equiv to "xml xsd type validate".
    

@ infoset can be used to check or advise or fail or even perform, depending on the implementation. The markup says what the assumption of the schema is, either for humans (to know how they need to configure their system) or for implementations (to configure automatically or to check as much as they can). The (*) is for extensibility, but an implementation would fail if there was something it didn't understand here.

An implementation that does not handle something should raise an ERROR.

  •             xinclude only takes expand
    
  •            dtd takes "expand" (schema advises that entity dereferencing is needed), "type" (DTD provides default values and IDs), validate   fails on invalidity
    
  •            xsd takes "type" (psvi) and "validate" fails on invalidity: maybe something else is needed for streaming? 
    
  •            rng takes "validate" (validation). I expect we want the other parts of DSDL here too.  
    

Example:

 <sch:phase id="validate-chapters">
    <sch:visit  element="yes" attribute="no"  branch="//chapter" />
    <sch:active pattern="validate-tables"/>
    <sch:active pattern="validate-titles"/>
    <sch:active pattern="validate-text"/>
    <sch:active pattern="validate-figures"/>
 </sch:phase>

 <sch:phase id="validate-appendixes">
    <sch:visit  element="yes" attribute="no"  branch="//appendix" />
    <sch:active pattern="validate-tables"/>
    <sch:active pattern="validate-titles"/>
    <sch:active pattern="validate-text"/>
    <sch:active pattern="validate-figures"/>
 </sch:phase>

 <sch:phase id="validate-technical-appendix">
    <sch:visit  element="yes" attribute="yes"  branch="//appendix[1]" />
    <sch:active pattern="validate-technical-attributes"/>
 </sch:phase>

In this example, we have phases to validate chapters only (all of them, only visiting elements), appendixes only (all of them, only visiting elements) and particular attibute constraints of the first appendix.

People seem to grok Schematron as a feather-duster for the places other schema languages cannot reach: i.e. more power, but used in an ancillary fashion to another Schema language. So it seems reasonable that a Schematron schema have a way to specify its infoset, because validation (with DTDs and XSD, not RNG) can change the information in the instance being validated: it is not a parallel process but a serial one.

@ role-priority

This specifies which role values are expected by the schema: this is information for document and schema writers and implementers, and it can allow the schema to be validated that it is limited to these tokens.

For example

   role-priority="fatal BASIC error warning DETAIL info tip"

has two sets of roles, one is fatal/error/warning/info/tip and the other is BASIC/DETAIL. The highest priority token in an @ROLE is the one (probably) used for that @ROLE by some application.

Moreover, the priority allows an implementation to adopt different strategies:

  • Divide/select/sort the patterns/rules/asserts in order to fit in with fail-fast behaviour (e.g. fail at the first fatal error): e.g. where the user wants to fail as fast as possible on some fatal error, and not continue validating. This might be roles like FATAL, ERROR etc.
  • Divide/select/sort the patterns/rules/asserts so that the SVRL has some pre-sort. E.g. an implementation might have an operational mode where when a rule files the assertions are tested in priority, with the assertion testing for each rule stopping at the first failure, but validation continuing. For example we might have
<sch:schema>
   <sch:visit  ...
         role-priority="PREREQUISITE DETAIL" 
  <sch:rule context="table">
    <sch:assert role="warning" test="@cols">A table should have a cols attribute</sch:assert>
    <sch:assert role="PREREQUISITE warning" test="row">A table should have at least one row</sch:assert>
    <sch:assert role="DETAIL error" test="count(row) = count(*)">A table should only have rows</sch:assert>
   </sch:rule>
   ...

In this example, when we find a table element, the rule first checks that there is at least one row (because of @ROLE has "PREREQUISITE".)
If there is not, it reports the issue (the role also says it is a warning) and does not test any more assertions on that node.
If there is, then it check the next lowest priority, which here has one assert that tables should only have rows. If that fails it reports the failure and does not test any more assertions on that node.

Conclusion

A main criticism of Schematron is that it is too slow: this is an issue that comes up on high volume servers such as firewalling: sometimes Schematron is used to extract the requirements, prototype and debug a filter successfully, then replaced by e.g. a SAX based program. That is fair enough.

However, there are many implementation strategies that could be provided to users that allow fast-fail, parallelization, reduced latency, or more targetted or sorted output. However, many of these require a little extra information about the schema: information which belongs in the schema, as it is adheres to particular phases, patterns, rules and assertions.

So I think the trick is how to leverage (as they used to say, to our snooty disdain) existing structures (such as phases and roles) to provide markup that implementers can readily support (i.e. which, by and large are optional to implement or have some trivial fallback.)

For example @branch can be implemented by various methods 1) Failing if found "not implemented" 2) using it to skip visiting certain nodes 3) visiting everywhere but not testing them, 4) filtering the incoming SVRL so that nodes outside the branch are not added to the report or 5) filtering the report after it is generated.

Similarly, @role-priority could be completely ignored by an implementation. Or it could just be used to validate the values of @ROLE in the same schema. Or it could be use to automatically prioritize testing assertions or patterns. Or it could be used in combination with an "skip tests in rule after first assertion fail" or "skip testing pattern on other nodes after first pattern with failed assert" or fail-fast ("skip testing any more nodes after first failure"). Or to only test all highest priority tests only.

These choices are for implementers to implement, but the priority of roles (and the infoset needed, and what kind of visiting is needed) is a schema concern (and the developer can decide whether and how to make use of it.) So all the Schematron schema needs to do is to provide suitable declarations that make explicit that hard-to-derive metadata about the schema.

@AndrewSales AndrewSales added the enhancement Adds new capabilities label Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adds new capabilities
Projects
None yet
Development

No branches or pull requests

2 participants