Project

Profile

Help

XDM representation of XML DOM with adjacent text nodes and CDATA sections: SaxonJS.XPath.evaluate parse-xml vs. xslt3

Added by Martin Honnen about 1 year ago

On Slack there is a thread about the XDM representation of XML DOM having adjacent text nodes e.g. through text nodes, CDATA sections, text nodes as child nodes.

Mike there says about XSLT in SaxonJS that we "(b) merge adjacent text/CDATA nodes". That statement astonished me as it didn't seem to appear to happen in tests in my XSLT 3 or XPath 3.1 fiddles using SaxonJS.

So I looked into it further and run a stylesheet through SaxonJS 2.5's xslt3 command line where the stylesheet tests the number of child nodes in an element of the input where the lexical markup has a text node followed by a CDATA section followed by a text node:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  expand-text="yes">
  
  <xsl:template match="test">
    <xsl:next-match/>
    <xsl:comment>child node count: {count(node())}, { node() ! (. instance of text()) }</xsl:comment>
  </xsl:template>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="/" name="xsl:initial-template">
    <xsl:next-match/>
    <xsl:comment>Run with {system-property('xsl:product-name')} {system-property('xsl:product-version')} {system-property('Q{http://saxon.sf.net/}platform')}</xsl:comment>
  </xsl:template>
  
</xsl:stylesheet>
<root>
  <test>a <![CDATA[ && b < c && ]]> d</test>
</root>

Indeed when running it through the command line with xslt3 the output is

<?xml version="1.0" encoding="UTF-8"?><root>
  <test>a  &amp;&amp; b &lt; c &amp;&amp;  d</test><!--child node count: 1, true-->
</root><!--Run with SaxonJS 2.5 Node.js-->

confirming the adjacent nodes are merged in the XDM representation.

Wondering why I didn't notice that in my XSLT 3 or XPath 3.1 fiddles I looked into the code and then I found that there is use XPath 3.1 with e.g. parse-xml to parse input strings with SaxonJS to input documents; then indeed I don't see any merging of adjacent text nodes e.g.

const SaxonJS = require('saxon-js');

const xml1 = `<root>
  <test>a <![CDATA[ && b < c && ]]> d</test>
</root>`;

const result = SaxonJS.XPath.evaluate(
  `parse-xml($xml)/*/(count(node()) || '; ' || (node() ! (. instance of text())) => string-join(''))`, 
  null, 
  { params: { xml: xml1 } }
);

console.log(result);

outputs 3; truefalsetrue.

So although in both cases its SaxonJS that does or at least controls the building of the XDM tree in the end the XSLT approach works with a different XDM tree than XPath 3.1 after a parse-xml.

Is that a known difference or just a flaw/quirk you might want to fix?


Replies (2)

RE: XDM representation of XML DOM with adjacent text nodes and CDATA sections: SaxonJS.XPath.evaluate parse-xml vs. xslt3 - Added by Martin Honnen about 1 year ago

When I "directly" use parse-xml from the command line with the -xp option then the merging of adjacent text nodes seems to happen as e.g. I get

xslt3 -xp:"parse-xml('<root>a<![CDATA[ && b < c && ]]>d</root>')/*/(count(node()) || '; ' || string-join(node(), ' '))"
"1; a && b < c && d"

So for some reason I don't understand parse-xml sometimes merges adjacent text nodes and sometimes doesn't do it.

RE: XDM representation of XML DOM with adjacent text nodes and CDATA sections: SaxonJS.XPath.evaluate parse-xml vs. xslt3 - Added by Michael Kay about 1 year ago

Note: this prompted me to raise a spec issue about whether parse-xml() is supposed to implement xsl:strip-space/xsl:preserve-space. I think the spec is unclear on this.

It's relevant insofar as SaxonJS (under the right circumstances...) does a pass over a supplied DOM modifying it (in situ) (a) to strip whitespace text nodes where required, and (b) to merge adjacent text/CDATA nodes.

Exactly what the "right circumstances" are needs closer investigation. However, it's a path where the node.js and browser behaviour is likely to be different, because the way we do XML parsing in the two cases is quite different.

Note also that issue #5144 is relevant here.

    (1-2/2)

    Please register to reply