Project

Profile

Help

Bug #4176

parse-xml() in Saxon-JS loses html namespace

Added by David Cramer 8 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Sprint/Milestone:
-
Start date:
2019-03-22
Due date:
% Done:

0%

Applies to JS Branch:
1.0, Trunk
Fix Committed on JS Branch:
Fixed in JS Release:
SEF Generated with:
Company:
-
Contact person:
-
Additional contact persons:
-

Description

To reproduce:

  1. Unzip the attached zip file, saxon-js-parse-xml-bug.zip to a directory served up by a web server.
  2. Also make Saxon-JS 1.2.0 available there
  3. Edit test.html to adjust the path to Saxon-JS
  4. Open the url to test.html in a browser.
  5. Paste the markup below into the text area.
  6. Click Click Me
  7. Use the browser's Inspect Element to inspect the text "Where's my namespace?" Notice that the atom namespace has been preserved, but the <p> element has lost its namespace.
<atom:foo xmlns:atom="http://www.w3.org/2005/Atom">
   <html:p xmlns:html="http://www.w3.org/1999/xhtml">Where's my namespace?</html:p>
</atom:foo>

In my case, I'm taking user input (parsed and transformed) and making POST to a web service. The service's xsd expects the html part of the document to be in a namespace and rejects the request if it is not. Even if I transform the parsed input to match elements without a namespace and try to add the html namespace back on, when written out the namespace is still missing.

saxon-js-parse-xml-bug.zip (3.33 KB) David Cramer, 2019-03-23 00:29 saxon-js-parse-xml-bug.zip

History

#1 Updated by Martin Honnen 8 months ago

Are you sure that it is the parse-xml function that loses the XHTML namespace? Or the following <xsl:copy-of select="$document"/> into a browser's text/html HTML document? It might be worth checking what $document/*/*/concat(namespace-uri(), ';', name()) outputs.

#2 Updated by Martin Honnen 8 months ago

According to https://saxonica.plan.io/issues/3066, the special treatment of (X)HTML elements as "no-namespace elements" is by design in Saxon-JS.

So the issue is not caused when parsing the input (whether it is parse-xml from a string or doc from a URI) but rather by the implementation of the XDM in Saxon-JS which special cases (X)HTML elements by stripping namespace and prefix:

if (SaxonJS.getPlatform().inBrowser && node instanceof HTMLElement && node.namespaceURI == "http://www.w3.org/1999/xhtml") {
                        return Atomic.QName.fromParts("", "", node.localName);
                    }

On the other hand, the note their says "this should only apply to HTML DOM traversal, not XML".

As the browser side DOM implementation doesn't make a difference between HTML elements in HTML document and in XML documents (since HTML5 in both kind of documents you get a HTMLElement in the XHTML namespace), there doesn't seem to be a way to preserve the namespace and prefix of XHTML elements in XML DOM documents in Saxon-JS's XDM.

#3 Updated by Michael Kay 8 months ago

  • Project changed from Saxon to Saxon-JS
  • Assignee set to Debbie Lockett

#4 Updated by Debbie Lockett 8 months ago

  • Status changed from New to In Progress
  • Priority changed from Low to Normal
  • Applies to JS Branch 1.0, Trunk added

As Martin has suggested, I can confirm that the problem is not actually caused by parse-xml(). The result from parse-xml() does have the correct XHTML namespace (and html prefix).

Furthermore, the Saxon-JS special treatment of XHTML elements means that the prefix and namespace will indeed be lost, certainly (by design) at the point that $document is added to the HTML page with <xsl:result-document>. But in fact it looks like the namespaces are lost even when making the copy with <xsl:copy-of select="$document"/>, which I think is not by design:

Much like the Saxon-JS code that Martin points to in domutils.nameOfNode, the code in domutils.copyItem looks suspicious because we drop all namespaces if SaxonJS.getPlatform().inBrowser && newNode instanceof HTMLElement. I think we should also be checking whether context.resultDocument == window.document; i.e. whether newNode is to be added to the HTML page or not (note for instance this condition is used in context.createElement as used to first create newNode).

You say that the actual intention is to send the $document in a POST request. I have done some further testing with the supplied repro, to see how that can work. One point to note is that if you are using ixl:schedule-action/@http-request, you will need to be careful to ensure that the supplied body is a document-node(), else it seems that the XHTML namespace may get lost at this stage. e.g. edit the $document variable in the button onclick template as shown in the example below

   <xsl:template match="button[@id = 'clickMe']" mode="ixsl:onclick">
       <xsl:variable name="document" as="document-node()">
            <xsl:try>
                <xsl:sequence select="parse-xml(ixsl:get(ixsl:page()//textarea, 'value'))"/>
                <xsl:catch><xsl:document><not-a-document/></xsl:document></xsl:catch>
            </xsl:try>
        </xsl:variable>

       <xsl:variable name="request"
            select="map{'body': $document, 
            'method': 'POST', 
            'media-type': 'application/xhtml+xml',
            'href': '...'}"/>

       <ixsl:schedule-action http-request="$request">
            <xsl:call-template name="handleResponse"/>
        </ixsl:schedule-action>

    </xsl:template>

Does this help you get a bit further? I guess it may depend on what further transforming you actually want to do to the result from parse-xml(), before it gets sent in the POST request...

#5 Updated by Michael Kay 8 months ago

At Debbie's instigation, I've been giving this some thought and trying to work back to first principles. We should probably be consistent with the way that the HTML5 specification attempts to resolve the problem, by modifying the semantics of XPath 1.0 and XSLT 1.0 as described here:

https://html.spec.whatwg.org/multipage/infrastructure.html#interactions-with-xpath-and-xslt

There are two parts to this.

Firstly, no-namespace names in path expressions are taken, under some circumstances, to match elements in the XHTML namespace. The specific circumstance is that the context node for the expression is "from an HTML DOM". This phrase is a bit too informal for our purposes; in an XSLT context it raises question like, if you do an xsl:copy-of a subtree from the HTML page, does that operate like an HTML DOM for this purpose? One way of interpreting the rule might be: for any axis step where the principal node kind is element and the node test is in the form of an NCName, if the context item for that axis step is an element in the XHTML namespace, or a document node whose child element is an element in the XHTML namespace, then interpret the name test as matching names in the XHTML namespace.

Secondly, tree construction. The HTML5 spec says that if the output method is "html", then: If the transformation program outputs an element in no namespace, the processor must, prior to constructing the corresponding DOM element node, change the namespace of the element to the HTML namespace, ASCII-lowercase the element's local name, and ASCII-lowercase the names of any non-namespaced attributes on the element.

Tying this to the output method doesn't work very well, at least not for XSLT 2.0+ where we have temporary trees and secondary result documents. The right time to do this conversion seems to be when we inject nodes into the HTML page. I think it's unambiguous when we are doing that, because it is only done using recognizable calls on xsl;result-document (plus things like ixsl:set-attribute)

Note that apart from these changes, elements in the HTML5 DOM appear as being in the XHTML namespace, for example namespace-uri() returns the XHTML namespace, and searches that explicitly request elements in the XHTML namespace succeed.

I have wondered about a more general solution to the problem of unprefixed element names in path expressions, which has always been one of the biggest usability problems in XPath. One solution is to interpret an unprefixed element name as matching on the local name only (that is, matching any namespace) -- and relying on the syntax Q{}local to match no namespace, where that is needed. Of course this would be a big incompatibility, so it would have to be switchable, but I suspect that it wouldn't break very much code, because the number of cases is very small where you write /a/b/c and actually want to get no match on elements having the right local name but the wrong namespace. Another approach is a generalisation of the HTML5 modification to XPath semantics: a mode of operation in which an unprefixed NCName in an axis step means "match elements having the same namespace as the context node" (or the root element, if starting at a document node).

Whatever we do it's probably sufficiently disruptive that we should only consider it for JS2.

#6 Updated by Debbie Lockett 5 months ago

Various changes have been made on the development branches for Saxon 10.0 and Saxon-JS 2.x. The major change is that in Saxon-JS 2.x, elements in the HTML5 DOM will appear as being in the XHTML namespace (rather than the null namespace as in Saxon-JS 1.x).

One consequence of the changes is that SEFs generated (with target JS2) using 9.9 or earlier will not necessarily work correctly with Saxon-JS 2.x. For example, user interaction events for page clicks, etc. don't work properly, due to the namespace change for HTML page elements.

This is one reason why we have decided that for use in Saxon-JS 2.x, SEFs will be required to be generated using 10.0 or later. A check for this has now been added in Saxon-JS 2.x (an error is thrown if the Saxon version used to generate the SEF is less than 10.0).

#7 Updated by Debbie Lockett 4 months ago

Saxon 10.0 development branch changes have been committed in the last couple of months to implement the following:

A new option -ns is available on the net.sf.saxon.Transform command line. It can be used to specify the default namespace for elements and types (in effect, a default for the xpath-default-namespace attribute). In addition, the value -ns:##any means that unprefixed element names appearing in path expressions and match patterns will match elements in any namespace (or none), and the value -ns:##html5 simulates the rules in the HTML5 specification, meaning that unprefixed element names match elements that are either in no namespace, or in the XHTML namespace.

Please register to edit this issue

Also available in: Atom PDF Tracking page