Project

Profile

Help

Bug #5144

open

SaxonJS.XPath.evaluate method fails to select CDATA section node as text() node

Added by Martin Honnen about 3 years ago. Updated 3 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
XPath Conformance
Sprint/Milestone:
Start date:
2021-10-24
Due date:
% Done:

0%

Estimated time:
Applies to JS Branch:
2
Fix Committed on JS Branch:
Fixed in JS Release:
SEF Generated with:
Platforms:
Company:
-
Contact person:
-
Additional contact persons:
-

Description

I have run into an odd problem trying to use the Saxon-JS 2.3 browser-side SaxonJS.XPath.evaluate method to select text nodes which are in the original parsed XML marked up as a CDATA section. The native XPath 1.0 implementation of the browser finds them, while Saxon-JS 2 doesn't seem to select them.

Test case is:

<!DOCTYPE html>
<html lang=en>
  <head>
    <meta charset=UTF-8>
    <title>Saxon-JS 2 CDATA section as text node test</title>
    <script src="/Saxon-JS-2.3/SaxonJS2.rt.js"></script>
  </head>
  <body>
    <h1>Saxon-JS 2 CDATA section as text node test</h1>
    <section>
      <h2>XPath tests</h2>
      <script>
const xmlSample1 = `<root>
  <description>description 1<\/description>
  <description><![CDATA[<p>description 2]]><\/description>
<\/root>`;

SaxonJS.getResource({ type: 'xml', text: xmlSample1 })
    .then(doc => {
        const textNodes = SaxonJS.XPath.evaluate(`root/description/text()`, doc, { resultForm: 'array' });

        console.log(textNodes);

        console.log(textNodes.length);
  
        const textNodeCount = SaxonJS.XPath.evaluate(`count(root/description/text())`, doc);
  
        console.log(textNodeCount);
  
        const xpath1TextNodeCount = doc.evaluate(`count(root/description/text())`, doc, null, XPathResult.NUMBER_TYPE).numberValue;
  
        console.log(xpath1TextNodeCount);
    });        
      </script>
    </section>
  </body>
</html>

Output in the console with Chrome is:

[text]
1
1
2

Online at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node1.html.

On Node.js this doesn't seem to occur.

Actions #1

Updated by Martin Honnen about 3 years ago

I have changed the test case to add some more elements with text node and with CDATA section nodes at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node2.html but it really seems CDATA section nodes in the DOM in the browser are not selected by foo/text() when using XPath.evaluate.

Actions #2

Updated by Martin Honnen about 3 years ago

Now I tried the same XPath expressions with description/text() from XSLT called through fn:transform from XPath.evaluate at https://martin-honnen.github.io/js/2021/cdata-section-test-from-xslt1.html, but then the CDATA sections are selected and processed.

Actions #3

Updated by Michael Kay about 3 years ago

Have you tried whether there's a difference between HTML and XML?

The code in ItemType.matcher() reads

matcher() {
        switch (this.kind) {
            case K.DOCUMENT_NODE:
                return item => DomUtils.isNode(item) && (item.nodeType === K.DOCUMENT_NODE || item.nodeType === K.DOCUMENT_FRAGMENT_NODE);
            default:
                return item => DomUtils.isNode(item) && item.nodeType === this.kind;
        }
    }

So I think if item.nodeType is CDATA_SECTION_NODE then the item type text() isn't going to match.

Note also that in Saxon-JS, we don't make any attempt to merge adjacent text/cdata nodes in the DOM - if adjacent nodes are present, they will be matched separately.

Actions #4

Updated by Martin Honnen about 3 years ago

I don't think I can put a CDATA section into an HTML (text/html) document, so I am not sure what your question "Have you tried whether there's a difference between HTML and XML?" refers to. Or do you want me to run the script doing the XPath evaluation from an XHTML/application/xhtml+xml document?

Actions #5

Updated by Martin Honnen about 3 years ago

I have tried to use XHTML (application/xhtml+xml) as the host environment for the Javascript code running SaxonJS.XPath.evaluate against the XML DOM at https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-text-node2.xhtml, but that doesn't make a difference, Saxon-JS 2.3 in the browser doesn't select the text nodes formed from CDATA sections.

In Node.js, it looks as if these are selected, but perhaps the parser/document builder there doesn't distinguish plain text nodes and CDATA section nodes, all (in the example 6) selected nodes have nodeType as 3 under Node.js. Not sure whether that is some Saxon-JS fixup or whether the underlying parser/document builder does that.

Actions #6

Updated by Michael Kay about 3 years ago

On the node.js side, Saxon has code that sits between the SAX parser and the DOM builder, and in that code I suspect that we convert CDATA nodes to text nodes. I suspect that if you build a DOM programmatically and create a CDATA node explicitly (or if you construct a DOM using a different parsing mechanism), you'll see the same problem. In the browser, XML parsing is entirely under browser control and Saxon doesn't get a look in.

Actions #7

Updated by Martin Honnen about 3 years ago

It is getting weirder, not even node() seems to select the CDATA section nodes: https://martin-honnen.github.io/js/2021/saxon-js-test-cdata-as-child-node2.html.

Does

const xmlSample1 = `<root>
  <description>description 1<\/description>
  <description><![CDATA[<p>description 2]]><\/description>
  <description>description 3<\/description>
  <description><![CDATA[<p>description 4]]><\/description>
  <description>description 5<\/description>
  <description><![CDATA[<p>description 6]]><\/description>
<\/root>`;

SaxonJS.getResource({ type: 'xml', text: xmlSample1 })
    .then(doc => {
        const childNodes = SaxonJS.XPath.evaluate(`root/description/node()`, doc, { resultForm: 'array' });

        console.log(childNodes);

        console.log(childNodes.length);
  
        const childNodeCount = SaxonJS.XPath.evaluate(`count(root/description/node())`, doc);
  
        console.log(childNodeCount);
  
        const xpath1ChildNodeCount = doc.evaluate(`count(root/description/node())`, doc, null, XPathResult.NUMBER_TYPE).numberValue;
  
        console.log(xpath1ChildNodeCount);
    });  

and only finds three child nodes where XPath 1 finds 6:

(3) [text, text, text]
3
3
6
Actions #8

Updated by Michael Kay about 3 years ago

Not weird if you look at the code, node() expands to

[K.ELEMENT_NODE, K.TEXT_NODE, K.COMMENT_NODE, K.PROCESSING_INSTRUCTION_NODE].includes(item.nodeType);
Actions #9

Updated by Martin Honnen about 3 years ago

I will leave at this, I guess, can't get at the createCDATASection method on Node.js on the doc exposed so no way to test if the problem arises there too. And for me the source code is minimized, so it is hard to read through that or make sense of variable names of length 1.

I think Norm mentioned somewhere on Slack that bug fixes for Saxon-JS are not to be expected before Declarative Amsterdam so let's see when he (or Debbie?) will pick it up.

Actions #10

Updated by Norm Tovey-Walsh about 3 years ago

Indeed. There's a lot on our plates before Declarative Amsterdam. But we'll see what we can do.

Actions #11

Updated by Norm Tovey-Walsh about 3 years ago

Indeed, Mike is right that I can make the selection work by allowing text() to match either TEXT_NODE children or CDATA_SECTION children which are a distinct type in the browser DOM.

There are almost 40 places where the code checks for TEXT_NODE and each one (and perhaps more) are going to have to be reviewed and adapted to handle the CDATA_SECTION type, I guess.

Actions #12

Updated by Michael Kay about 3 years ago

And worse, there will be places like the code cited in comment #3 that don't test for TEXT_NODE explicitly, but assume a one-to-one mapping between DOM node kinds and XDM node kinds.

Possible approach to testing: in NodeJSPlatform.js, line 120 where we convert all SAX ontext and oncdata events to text nodes in the DOM, create CDATA nodes instead, and see what breaks.

Actions #13

Updated by Norm Tovey-Walsh about 3 years ago

I've made a first pass through the code examining the places where TEXT_NODE was explicitly referenced (which doesn't address Mike's very valid point about the places where it's just assumed that the 1:1 mapping will work). I'd say about half of them already handled the CDATA_SECTION_NODE case as well. In most places, it was fairly obvious what to change, but this will need to be run through testing and we should write some more unit tests for CDATA sections.

Actions #15

Updated by Norm Tovey-Walsh over 2 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone changed from Saxon-JS 2.3 to SaxonJS 3.0
Actions #17

Updated by Norm Tovey-Walsh 3 months ago

As Martin notes, SaxonJS 2.5 works better than 2.3 did. We've made some improvements in this area. There are aspects of the parse that we don't control, so it's challenging to construct test cases. The simple ones, where an input document with CDATA sections is parsed seem to work. (In as much as text() matches CDATA sections).

I'm inclined to mark this as resolved. We can reopen it or open a new issue if we discover an actual use case where things appear to be broken. Even then, we may or may not be able to fix it. We don't, for example, coallese adjacent text nodes as one would expect, we take what the parser gives us. So the answers may sometimes be different but not things we can fix.

Please register to edit this issue

Also available in: Atom PDF Tracking page