Bug #4704
closedparse-xml fails in browser on string that has XML declaration in CDATA section
100%
Description
With Saxon JS 2 in the browser (tested with Chrome 84), the JavaScript code
SaxonJS.XPath.evaluate(`parse-xml($xml)`, [], { params : { 'xml' : `<root><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<a/>
<a/>]]></root>` }})
fails for me with
SaxonJS2.rt.js:548 Uncaught q {message: "Misplaced or malformed XML", stack: "Error↵ at new q (http://saxonica.com/saxon-js/d…S/SaxonJS2.rt.js:789:494)↵ at <anonymous>:1:15", name: "XError", code: "FODC0006"}
I think the parsing should work without giving any error.
Updated by Martin Honnen over 4 years ago
With Node the code runs fine and doesn't give a parse error.
Updated by Michael Kay over 4 years ago
In the browser, Saxon uses the browser-supplied XML parser and these have many limitations. We don't know exactly what the limitations are, but we have no way of fixing them other than switching to use our own XML parser.
On Node.js, Saxon-JS uses a modified version of the SAX2 open-source parser. Some of the modifications we've made are to fix non-conformances like this.
Looking at the code, NodeJSPlatform.js linie 232 has
if (/^.+<\?xml/i.test(str)) {
// throw new Error();
}
which is pretty useless; more intelligently, line 115 has
parser["onprocessinginstruction"] = function (tag) {
if (tag.name !== "xml") {
appendElement(document.createProcessingInstruction(tag.name, tag.body));
}
};
which is Saxon adding a check at application level that should have been done by the parser. But it's still not right, because (a) the check should be case-blind, and (b) we should error the construct rather than ignoring it. There are many little details like this where the XML parsing technology on the Javascript platform is grossly inadequate.
Updated by Michael Kay over 4 years ago
In fact, BrowserPlatform.js line 443 has the same check, without the commenting out:
if (/^.+<\?xml/i.test(str)) {
throw new Error();
}
and this is leading to the error you observe. It's a half-hearted and incorrect attempt to fix a bug in the browser that we're not in a position to fix properly.
If we had access to a fully-functional and conformant XML parser written in JS, then I think we'd be offering a choice so you can decide whether to use that in place of the browser-vendor's parser. But we don't, sadly.
Updated by Martin Honnen over 4 years ago
Michael Kay wrote:
In the browser, Saxon uses the browser-supplied XML parser and these have many limitations. We don't know exactly what the limitations are, but we have no way of fixing them other than switching to use our own XML parser.
I understand but
new DOMParser().parseFromString(`<root><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<a/>
<a/>]]></root>`, 'application/xml')
with the browser's XML parser does work.
I guess the check if (/^.+<\?xml/i.test(str))
is in there to fix some lack of strictness of the parsers in browsers allowing content before an XML declaration.
I can't really judge which issue is more important so feel free to close this as not fixable.
Updated by Community Admin almost 4 years ago
- Fix Committed on JS Branch 2 added
- Fix Committed on JS Branch deleted (
2.0)
Updated by Debbie Lockett almost 4 years ago
- Applies to JS Branch 2 added
- Fix Committed on JS Branch deleted (
2)
Updated by Norm Tovey-Walsh over 2 years ago
- Sprint/Milestone set to SaxonJS 3.0
Updated by Debbie Lockett over 2 years ago
Returning to this after it has been forgotten about for some time. It does look like it would be worthwhile tidying up this code.
I assumed the /^.+<\?xml/i.test(str)
check in BrowserPlatform (which produces the observed error) was there to ensure other tests passed, but it looks like we can simply remove this check without causing test failures. Being liberal here is probably the better way to go.
We suspect that the check actually originated from trying to handle XML parsing on the Node.js side; but was replaced there once the sax2parse
function handled this in a better way, with the parser["onprocessinginstruction"]
code:
parser["onprocessinginstruction"] = function (tag) {
if (tag.name !== "xml") {
appendElement(document.createProcessingInstruction(tag.name, tag.body));
}
};
Above, Mike had suggested improving this code:
(a) the check should be case-blind, and (b) we should error the construct rather than ignoring it
For (a), it seems that tag.name
is always lower-case, so I don't think there's anything to do.
For (b), I propose adding:
else if (document.hasChildNodes()) {
// Check for misplaced XML/text declarations
throw new XError("XML parse error - misplaced XML declaration", "FODC0006");
}
We only want to produce an error for misplaced XML declarations; I believe we should continue to ignore them otherwise.
Updated by Debbie Lockett over 2 years ago
- Status changed from New to Resolved
- Sprint/Milestone changed from SaxonJS 3.0 to SaxonJS 2.5
- Applies to JS Branch Trunk added
- Fix Committed on JS Branch 2, Trunk added
Code changes (in BrowserPlatform.js and NodeJSPlatform.js) committed on saxonjs2 and main branches. Browser and nodejs unit tests (iss4704) added.
Updated by Norm Tovey-Walsh over 2 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in JS Release set to SaxonJS 2.5
Fixed in SaxonJS 2.5.
Please register to edit this issue
Also available in: Atom PDF Tracking page