Project

Profile

Help

Support #6254

closed

The undeclaration of the default namespace xmlns="" on a child namespace is not reported to a org.xml.sax.ContentHandler via prefix mappings which leads to an invalid XdmNode in the result

Added by John Francis 5 months ago. Updated 3 days ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Internals
Sprint/Milestone:
-
Start date:
2023-11-16
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

We have a sample file

<?xml version="1.0" encoding="UTF-8"?>
<article xmlns="http://docbook.org/ns/docbook" version="5.0">
    <info>
      <title>Additional Phrase Container Example</title>
    </info>
    <sect1>
      <title>Additional Phrase Container Example</title>
      <para>
        <extension1 xmlns="">Some text.</extension1>        
      </para>
      <para>
        <myns:extension2 xmlns:myns="myns">Some text.</myns:extension2>        
      </para>
      <para>
        <extension3>Some text.</extension3>
      </para>
    </sect1>
</article>

Prior to Saxon 10 we got a call to org.xml.sax.ContentHandler.startPrefixMapping(String, String) with both arguments Set to "" corresponding to the xmlns="" default namespace un declaration at /article/sect1[1]/para[1]/extension1[1]/@xmlns . From Saxon 10 onwards we no longer get these events when we are using the Saxon API (as opposed to JAXP) and net.sf.saxon.event.ContentHandlerProxy . We do however get the correct arguments to startElement(String, String, String, Attributes) , "", "extension1", "extension1". However if all that our ContentHandler does is pass the events on to Saxon classes ( net.sf.saxon.event.ReceivingContentHandler ) that serializes the result into an XdmNode that XdmNode is invalid. If we then use it as the source for running a net.sf.saxon.s9api.XsltTransformer we get the following error even though it contains virtually not code;

XTDE0440  Cannot output a namespace node for the default namespace
  (http://docbook.org/ns/docbook) when the element is in no namespace

What is the correct way now to call a org.xml.sax.ContentHandler using an XdmNode as a source and producing an XdmNode as a result if it contains xmlns=""?


I have included a zip of a standalone Maven project to reproduce the problem. There are 3 tests, com.deltaxml.test.Tests.test1() just does a JAXP version of the test which of course runs fine. com.deltaxml.test.Tests.test2() uses the Saxon APIs to do the same thing and throws the above error. com.deltaxml.test.Tests.test3() tweaks the test ContentHandler com.deltaxml.test.SaxHandler so that it detects when it has received a startElement event which is in a default namespace bound to the null prefix and simulates a prefix mapping. I would be grateful if you could see any problem with this 'workaround'.

FWIW I ran the tests in a debugger with SaxonHE and saw that net.sf.saxon.tree.tiny.TinyTree has a net.sf.saxon.om.NamespaceBinding from prefix "" to nsURI "" in net.sf.saxon.tree.tiny.TinyTree.namespaceBinding array. Whereas in Saxon 12 net.sf.saxon.tree.tiny.TinyTree does not have an entry for it in thenet.sf.saxon.tree.tiny.TinyTree.namespaceMaps array. The code in Saxon 10+ seems to implement xmlns="" as the removal of a namespace, but ContentHandlerProxy doesn't do anything to be backward compatible. All of which would be fine for us except that when things are being passed through we end up getting a an XdmNode which cannot be used!

Thanks for your time


Files

saxon10nullnamespaceproblem.zip (147 KB) saxon10nullnamespaceproblem.zip A Maven project to reproduce the problem John Francis, 2023-11-16 16:13
Actions #1

Updated by John Francis 5 months ago

This seems similar to https://saxonica.plan.io/issues/6036 but 11.6 does not fix it.

Actions #2

Updated by John Francis 5 months ago

The maven project will need to be editeed to point to the location of your Saxon jars ... I was using PE versions to reproduce sorry should have used HE

Actions #3

Updated by John Francis 5 months ago

FYI the workaround seems to cause problems with other tests related to https://saxonica.plan.io/issues/4996 But I need to investigate this further.

Actions #4

Updated by Michael Kay 5 months ago

  • Status changed from New to In Progress

You are right that this is closely related to bug #6036.

The fix to #6036 involved inserting a NamespaceDifferencer into the Receiver pipeline whenever the pipeline feeds into a ContentHandlerProxy. The NamespaceDifferencer translates between the (new) Receiver model of namespaces, in which every startElement() event contains a NamespaceMap that identifies all in-scope namespace bindings, and the SAX model, in which changes in the namespace context are explicitly notified.

Unfortunately this fix wouldn't solve the problem for a ContentHandlerProxy created directly by a user application.

You can use the new factory method ContentHandlerProxy.makeInstance() to construct a ContentHandlerProxy fronted by a NamespaceDifferencer. It does this:

    public static Receiver makeInstance(ContentHandler handler, Properties serializationProps) {
        ContentHandlerProxy chp = new ContentHandlerProxy(handler);
        chp.setOutputProperties(serializationProps);
        return new NamespaceDifferencer(chp, serializationProps);
    }

You're also right to observe that this all stems from a radical change in the way namespaces are handled between Saxon9 and Saxon10, bringing it closer to the XDM model and further from the SAX model: this affects both the TinyTree and the Receiver interface.

Looking at your code, I do wonder why you found it necessary to use the internal Saxon classes ReceivingContentHandler and ContentHandlerProxy for bridging between SAX and Receiver interfaces, rather than relying on public API classes such as SAXSource, SAXResult, and SAXDestination. I guess you probably had a good reason, but I'm afraid that by interfacing to Saxon at this level, you do run the risk that the details can sometimes change between releases.

Actions #5

Updated by John Francis 5 months ago

Thankyou for your response.

Please can you point me to docs or an example of how I can use the public API's to take a XdmNode, run it through a SAX ContentHandler and get an XdmNode result. I have looked at the resources jar examples and at the javadoc and I cannot work it out. I always seem to end up needing ReceivingContentHandler?

Thanks.

Actions #6

Updated by John Francis 5 months ago

Do you mean that since we are only really using XdmNodes and Saxon APIs now that we should be using a Saxon equivalent of ContentHandler and implement that instead?

Actions #7

Updated by Michael Kay 5 months ago

I'm not sure why you want SAX in the picture to get from an XdmNode to another XdmNode, but presumably it's because there's a SAX filter in between.

If you've got a ContentHandler ch then the easiest way to feed it with an XdmNode node is using

SAXDestination dest = new SAXDestination(ch);
processor.writeXdmValue(node, dest);

If you then want to feed the events into another node, you can do

DocumentBuilder builder = processor.newDocumentBuilder();
BuildingContentHandler handler = builder.newBuildingContentHandler();

then feed your SAX events into handler, and at the end do handler.getDocumentNode().

If you want to bypass SAX and use Saxon's native Receiver interface instead, then you can subclass net.sf.saxon.event.ProxyReceiver. You can send an XdmNode to a Receiver using xdmNode.getUnderlyingNode().copy(....), and you can pipe the output of your ProxyReceiver into a Builder (e.g. new TinyBuilder(pipe)) which implements Receiver. As you might expect this is a bit more efficient but also a bit more fiddly, involving access to the Saxon Configuration and creation of a PipelineConfiguration object, and again it's a little bit more fragile because interfaces at this level can change between releases.

Of course if you literally just want to copy an XdmNode then you can do this with

XdmNode copy = documentBuilder.build(node.asSource());
Actions #8

Updated by Michael Kay 5 months ago

  • Tracker changed from Bug to Support
  • Status changed from In Progress to Closed

Closing this with no further action. Feel free to reopen if there is still a problem.

Actions #9

Updated by John Francis 5 months ago

Thanks for that Michael.

Here is a summary of where I think we are, and what I see as options for maintaining the SAX ContentHandler functionality. Any further comments would be appreciated.

To recap, the context is that there are a number of existing SAX ContentHandlers which operate on XdmNodes and return XdmNodes as most of the rest of the filter chain uses XSLTs.

We can

  1. Continue using the ContentHandlers and the approach you outline above
  2. Rework the ContentHandlers to use the Saxon native approach subclassing net.sf.saxon.event.ProxyReceiver, as you outline above.
  3. Write an XSLT that calls the java

The first of these has the benefit that we can keep it to using Saxon public APIs (Looking at the documentation for Saxon 12 https://www.saxonica.com/documentation12/index.html#!javadoc That would seem to mean any class/interface in s9api Interfaces or Other Interfaces.). The downside is that it has the weakness/difficulties of SAX's event model?

The second has the benefits of efficiency as you say, and better tracks the XDM Tree specification and its evolution. The downside here is that it is more fragile with changes to Saxon releases potentially breaking it.

The third of these would be more complex as we would have to re implement the State Machine from the ContentHandler in/via XSLT, but that is something we have done elsewhere.

Actions #10

Updated by Michael Kay 5 months ago

  • Status changed from Closed to In Progress

In your situation, to the extent that I understand it, I would try to use the mechanisms provided at the s9api level to support SAX input and output, namely the SAXSource for input and the SAXDestination for output. That is, your option 1. I would try to avoid diving any deeper into what I call the Saxon "system programming" layer which includes classes like NodeInfo and Receiver (your option 2); although we try to keep these interfaces as stable as we can, when there is a major design change like the change in namespace handling between 9 and 10 then it does tend to lead to small changes at this level, which are enough to break applications.

(Between 12 and 13 there is going to be a design change in the way we manage schema information, to get away from the "one global schema" model. It's very hard to do this kind of thing without some impact on APIs.)

Actions #11

Updated by John Francis 5 months ago

Thanks Michael, that was helpful. Please close the ticket

Actions #12

Updated by Michael Kay 3 days ago

  • Status changed from In Progress to Closed

Please register to edit this issue

Also available in: Atom PDF