Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add at least rudimentary DTD handling #50

Open
frp opened this issue Sep 16, 2017 · 6 comments
Open

Add at least rudimentary DTD handling #50

frp opened this issue Sep 16, 2017 · 6 comments

Comments

@frp
Copy link

frp commented Sep 16, 2017

sxd-document does not work if the document has internal DTD.

Trying to parse the example document (taken from here):

<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>

I get the following error: (37, [Expected("SYSTEM")]). 37 is the position of the first [.
Adding full parsing of internal DTD might be quite a big deal, but, for my purposes (parsing JMdict) just ignoring the DTD content would be fine.

@shepmaster
Copy link
Owner

Mentoring Instructions

The DTD grammar and tokens are defined in the XML spec.

Work for this will probably largely live in parse_document_type_declaration:

fn parse_document_type_declaration<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>) -> XmlProgress<'a, Token<'a>> {

For now, we aren't really worried about amazing performance or even collecting the data, all we need to make sure of is that the parser can accept the input data.

A small example to add to the tests:

<?xml version="1.0"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd">
<cXML />

Strangely, it kind of seems like this much of it should already be supported; I'm guessing that there's some small mistake in what the parser accepts.

@onelson
Copy link
Contributor

onelson commented Mar 10, 2018

OK, so I've been tinkering and learning. So far, I have found that the presence of a / character is enough to cause breakage.

I traced the failure back to parse_one_quoted_value

sxd-document/src/parser.rs

Lines 385 to 387 in 5e2253d

let (xml, _) = try_parse!(xml.consume_literal(quote).map_err(|_| Error::ExpectedOpeningQuote(quote)));
let (xml, value) = try_parse!(f(xml));
let (xml, _) = try_parse!(xml.consume_literal(quote).map_err(|_| Error::ExpectedClosingQuote(quote)));

I'm assuming that the 2nd try_parse!() here is meant to consume all characters up until the token passed in (in this case either a ' or a "), yet it's terminating prematurely at the /.

I don't really see a reason for this, but at this superficial level I suspect the bug is actually in the peresil crate which provides the try_parse!() macro. I may pivot over to write some tests for peresil and see if I can reproduce with a simple case.

@shepmaster
Copy link
Owner

It's actually because the grammar for parse_external_id got mistranslated:

commit 7e6b99340b40a06ec938c9d12b96d103c4e4a5ac
Author: Jake Goulding <jake.goulding@gmail.com>
Date:   Wed Mar 7 19:31:17 2018 -0500

    DTD SystemLiterals can contain any character

diff --git a/src/parser.rs b/src/parser.rs
index 995ba0d..b645f9c 100644
--- a/src/parser.rs
+++ b/src/parser.rs
@@ -481,8 +481,10 @@ fn parse_external_id<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>)
     let (xml, _) = try_parse!(xml.expect_literal("SYSTEM"));
     let (xml, _) = try_parse!(xml.expect_space());
     let (xml, external_id) = try_parse!(
-        parse_quoted_value(pm, xml, |_, xml, _| xml.consume_name().map_err(|_| SpecificError::ExpectedSystemLiteral))
-        );
+        parse_quoted_value(pm, xml, |_, xml, quote| {
+            xml.consume_attribute_value(quote).map_err(|_| SpecificError::ExpectedSystemLiteral)
+        })
+    );
 
     success(external_id, xml)
 }
@@ -1311,11 +1313,11 @@ mod test {
 
     #[test]
     fn a_prolog_with_a_document_type_declaration() {
-        let package = quick_parse("<?xml version='1.0'?><!DOCTYPE doc SYSTEM \"doc.dtd\"><hello/>");
+        let package = quick_parse(r#"<?xml version="1.0"?><!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd"><cXML />"#);
         let doc = package.as_document();
         let top = top(&doc);
 
-        assert_qname_eq!(top.name(), "hello");
+        assert_qname_eq!(top.name(), "cXML");
     }
 
     #[test]

@onelson
Copy link
Contributor

onelson commented Mar 10, 2018

Oh, mm. 😊 I guess I was about to go on a wild goose chase.

@onelson
Copy link
Contributor

onelson commented Mar 10, 2018

Taking almost verbatim what was shown here, the test is passing.

In order to not be a total plagiarist, I'm going to try and work up a solution for the OP's case. Worth a shot at least!

@onelson
Copy link
Contributor

onelson commented Mar 11, 2018

I've added the example from @frp as a test (which is now passing). Not sure what else I can do outside of addressing any nitpicks.

onelson added a commit to LaikaStudios/sxd-document that referenced this issue Mar 12, 2018
shepmaster pushed a commit to LaikaStudios/sxd-document that referenced this issue Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants