Bug #5144
openSaxonJS.XPath.evaluate method fails to select CDATA section node as text() node
0%
Description
I have run into an odd problem trying to use the Saxon-JS 2.3 browser-side SaxonJS.XPath.evaluate method to select text nodes which are in the original parsed XML marked up as a CDATA section. The native XPath 1.0 implementation of the browser finds them, while Saxon-JS 2 doesn't seem to select them.
Test case is:
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset=UTF-8>
<title>Saxon-JS 2 CDATA section as text node test</title>
<script src="/Saxon-JS-2.3/SaxonJS2.rt.js"></script>
</head>
<body>
<h1>Saxon-JS 2 CDATA section as text node test</h1>
<section>
<h2>XPath tests</h2>
<script>
const xmlSample1 = `<root>
<description>description 1<\/description>
<description><![CDATA[<p>description 2]]><\/description>
<\/root>`;
SaxonJS.getResource({ type: 'xml', text: xmlSample1 })
.then(doc => {
const textNodes = SaxonJS.XPath.evaluate(`root/description/text()`, doc, { resultForm: 'array' });
console.log(textNodes);
console.log(textNodes.length);
const textNodeCount = SaxonJS.XPath.evaluate(`count(root/description/text())`, doc);
console.log(textNodeCount);
const xpath1TextNodeCount = doc.evaluate(`count(root/description/text())`, doc, null, XPathResult.NUMBER_TYPE).numberValue;
console.log(xpath1TextNodeCount);
});
</script>
</section>
</body>
</html>
Output in the console with Chrome is:
[text]
1
1
2
Online at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node1.html.
On Node.js this doesn't seem to occur.
Updated by Martin Honnen about 3 years ago
I have changed the test case to add some more elements with text node and with CDATA section nodes at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node2.html but it really seems CDATA section nodes in the DOM in the browser are not selected by foo/text()
when using XPath.evaluate
.
Updated by Martin Honnen about 3 years ago
Now I tried the same XPath expressions with description/text()
from XSLT called through fn:transform
from XPath.evaluate
at https://martin-honnen.github.io/js/2021/cdata-section-test-from-xslt1.html, but then the CDATA sections are selected and processed.
Updated by Michael Kay about 3 years ago
Have you tried whether there's a difference between HTML and XML?
The code in ItemType.matcher() reads
matcher() {
switch (this.kind) {
case K.DOCUMENT_NODE:
return item => DomUtils.isNode(item) && (item.nodeType === K.DOCUMENT_NODE || item.nodeType === K.DOCUMENT_FRAGMENT_NODE);
default:
return item => DomUtils.isNode(item) && item.nodeType === this.kind;
}
}
So I think if item.nodeType is CDATA_SECTION_NODE then the item type text() isn't going to match.
Note also that in Saxon-JS, we don't make any attempt to merge adjacent text/cdata nodes in the DOM - if adjacent nodes are present, they will be matched separately.
Updated by Martin Honnen about 3 years ago
I don't think I can put a CDATA section into an HTML (text/html) document, so I am not sure what your question "Have you tried whether there's a difference between HTML and XML?" refers to. Or do you want me to run the script doing the XPath evaluation from an XHTML/application/xhtml+xml document?
Updated by Martin Honnen about 3 years ago
I have tried to use XHTML (application/xhtml+xml) as the host environment for the Javascript code running SaxonJS.XPath.evaluate
against the XML DOM at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node2.xhtml, but that doesn't make a difference, Saxon-JS 2.3 in the browser doesn't select the text nodes formed from CDATA sections.
In Node.js, it looks as if these are selected, but perhaps the parser/document builder there doesn't distinguish plain text nodes and CDATA section nodes, all (in the example 6) selected nodes have nodeType
as 3 under Node.js. Not sure whether that is some Saxon-JS fixup or whether the underlying parser/document builder does that.
Updated by Michael Kay about 3 years ago
On the node.js side, Saxon has code that sits between the SAX parser and the DOM builder, and in that code I suspect that we convert CDATA nodes to text nodes. I suspect that if you build a DOM programmatically and create a CDATA node explicitly (or if you construct a DOM using a different parsing mechanism), you'll see the same problem. In the browser, XML parsing is entirely under browser control and Saxon doesn't get a look in.
Updated by Martin Honnen about 3 years ago
It is getting weirder, not even node()
seems to select the CDATA section nodes: https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-child-node2.html.
Does
const xmlSample1 = `<root>
<description>description 1<\/description>
<description><![CDATA[<p>description 2]]><\/description>
<description>description 3<\/description>
<description><![CDATA[<p>description 4]]><\/description>
<description>description 5<\/description>
<description><![CDATA[<p>description 6]]><\/description>
<\/root>`;
SaxonJS.getResource({ type: 'xml', text: xmlSample1 })
.then(doc => {
const childNodes = SaxonJS.XPath.evaluate(`root/description/node()`, doc, { resultForm: 'array' });
console.log(childNodes);
console.log(childNodes.length);
const childNodeCount = SaxonJS.XPath.evaluate(`count(root/description/node())`, doc);
console.log(childNodeCount);
const xpath1ChildNodeCount = doc.evaluate(`count(root/description/node())`, doc, null, XPathResult.NUMBER_TYPE).numberValue;
console.log(xpath1ChildNodeCount);
});
and only finds three child nodes where XPath 1 finds 6:
(3) [text, text, text]
3
3
6
Updated by Michael Kay about 3 years ago
Not weird if you look at the code, node() expands to
[K.ELEMENT_NODE, K.TEXT_NODE, K.COMMENT_NODE, K.PROCESSING_INSTRUCTION_NODE].includes(item.nodeType);
Updated by Martin Honnen about 3 years ago
I will leave at this, I guess, can't get at the createCDATASection
method on Node.js on the doc exposed so no way to test if the problem arises there too. And for me the source code is minimized, so it is hard to read through that or make sense of variable names of length 1.
I think Norm mentioned somewhere on Slack that bug fixes for Saxon-JS are not to be expected before Declarative Amsterdam so let's see when he (or Debbie?) will pick it up.
Updated by Norm Tovey-Walsh about 3 years ago
Indeed. There's a lot on our plates before Declarative Amsterdam. But we'll see what we can do.
Updated by Norm Tovey-Walsh about 3 years ago
Indeed, Mike is right that I can make the selection work by allowing text()
to match either TEXT_NODE
children or CDATA_SECTION
children which are a distinct type in the browser DOM.
There are almost 40 places where the code checks for TEXT_NODE
and each one (and perhaps more) are going to have to be reviewed and adapted to handle the CDATA_SECTION
type, I guess.
Updated by Michael Kay about 3 years ago
And worse, there will be places like the code cited in comment #3 that don't test for TEXT_NODE explicitly, but assume a one-to-one mapping between DOM node kinds and XDM node kinds.
Possible approach to testing: in NodeJSPlatform.js, line 120 where we convert all SAX ontext
and oncdata
events to text nodes in the DOM, create CDATA nodes instead, and see what breaks.
Updated by Norm Tovey-Walsh about 3 years ago
I've made a first pass through the code examining the places where TEXT_NODE
was explicitly referenced (which doesn't address Mike's very valid point about the places where it's just assumed that the 1:1 mapping will work). I'd say about half of them already handled the CDATA_SECTION_NODE case as well. In most places, it was fairly obvious what to change, but this will need to be run through testing and we should write some more unit tests for CDATA sections.
Updated by Norm Tovey-Walsh over 2 years ago
- Priority changed from Normal to High
- Sprint/Milestone changed from Saxon-JS 2.3 to SaxonJS 3.0
Updated by Martin Honnen about 2 years ago
https://martin-honnen.github.io/xslt/2023/saxon-js-2.5-test-cdata-as-text-node1.html suggests SaxonJS 2.5 at least finds the cdata section that 2.3 doesn't find in https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node1.html. And https://martin-honnen.github.io/xslt/2023/saxon-js-2.5-test-cdata-as-child-node2.html suggests SaxonJS 2.5 finds the cdata section nodes that 2.3 doesn't find in https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-child-node2.html.
Updated by Norm Tovey-Walsh 3 months ago
As Martin notes, SaxonJS 2.5 works better than 2.3 did. We've made some improvements in this area. There are aspects of the parse that we don't control, so it's challenging to construct test cases. The simple ones, where an input document with CDATA sections is parsed seem to work. (In as much as text()
matches CDATA sections).
I'm inclined to mark this as resolved. We can reopen it or open a new issue if we discover an actual use case where things appear to be broken. Even then, we may or may not be able to fix it. We don't, for example, coallese adjacent text nodes as one would expect, we take what the parser gives us. So the answers may sometimes be different but not things we can fix.
Please register to edit this issue
Also available in: Atom PDF Tracking page