Project

Profile

Help

NET XdmNode built by DocumentBuilder.Build from XmlReader

Added by Martin Honnen almost 3 years ago

I am trying to find a way to feed an AngleSharp DOM document to Saxon on .NET, the only way I have found so far is using the AngleSharp.XPath package which exposes a CreateNavigator method to return an XPathNavigator which has a method ReadSubtree to return an XmlReader which I then pass to Saxon's DocumentBuilder.Build method.

With some fixes to the online AngleSharp.XPath I get Saxon (using 10.5.1) to build an XdmNode, what is interesting is that the element nodes of an HTML5 document end up in the XHTML namespace but the OuterXml or serialization of the XdmNode doesn't show the XHTML namespace, the elements are serialized in no namespace.

So I get an XdmNode document which has elements with a NodeName.Uri being http://www.w3.org/1999/xhtml and I need to declare that namespace to select elements in the document with XPath, based on that Saxon has a consistent DOM with elements in the XHTML namespace, on the other hand OuterXml doesn't show the namespace.

Is that partly a flaw in Saxon?

I see that the created XmlReader doesn't output namespace declaration attributes (i.e. it doesn't report an attribute named xmlns with the value http://www.w3.org/1999/xhtml) so that is certainly some difference to reading lexical markup. .

NET's "native" XmlDocument, when pulling in the XmlReader created by ReadSubtree, builds a DOM with elements in the XHTML namespace and serializes with the XHTML namespace.


Replies (7)

Please register to reply

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen almost 3 years ago

It looks as if Saxon's normal serialization fails to output the namespace of the elements as a namespace declaration; however, using xdmNode.WriteTo(XmlWriter.Create(...)), i.e. having Saxon write to a .NET XmlWriter, manages to output the namespace declaration. That is probably also why XmlDocument or XDocument output the namespace declaration, as they rely on the underlying .NET XmlWriter implementations to serialize trees.

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay almost 3 years ago

The Saxon/.NET product basically accepts XML input in three ways:

(a) via its embedded version of the Apache Xerces parser (Xerces/J converted to .NET using IKVMC)

(b) via an instance of the Microsoft System.Xml.XmlReader interface

(c) as a wrapped instance of the Microsoft XML DOM.

It's probably also possible to supply input directly using the low-level Receiver interface, but that's only for the intrepid.

It looks like you're trying to do (b), but with an implementation of XmlReader that differs in some ways from Saxon's expectations of the interfaces, particularly in the area of namespaces. Since XmlReader is only specified rather informally, it's difficult to be definitive about whether that would be a nonconformance. Generally we would have to say that using any XmlReader other than the standard Microsoft ones is untested and therefore unsupported; on the other hand we're happy to look at making changes if they're testable.

Support for HTML-style DOMs is always tricky because there's no universal mapping from HTML to XDM.

Saxon's serializer doesn't do any namespace fixup; it assumes that it's starting with an XDM instance that satisfies all the XDM consistency constraints.

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen almost 3 years ago

Does https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/saxon10/entry/src/main/csharp/api/Saxon.Api/Model.cs#L2600 with

       public void WriteTo(XmlWriter writer)
        {
            JNodeInfo node = ((JNodeInfo)value);
            JDotNetReceiver receiver = new JDotNetReceiver(writer);
            receiver.setPipelineConfiguration(node.getConfiguration().makePipelineConfiguration());
            receiver.open();
            node.copy(receiver, net.sf.saxon.om.CopyOptions.ALL_NAMESPACES, JLoc.NONE);
            receiver.close();
        }

and the node.copy(receiver, net.sf.saxon.om.CopyOptions.ALL_NAMESPACES, JLoc.NONE); perform that namespace fixup (e.g. ensuring that an element node in a certain namespace emits an XML declaration for that namespace)?

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay almost 3 years ago

The Saxon code shown in WriteTo() won't do any namespace fixup, but the XmlWriter might - I don't know.

Saxon namespace fixup (in 10.x) is implemented essentially in the ComplexContentOutputter class.

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen almost 3 years ago

So even with Saxon Java, if a in-memory DOM tree is built and wrapped, as in

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);

        Document doc = documentBuilderFactory.newDocumentBuilder().newDocument();

        Node root = doc.appendChild(doc.createElementNS("http://www.w3.org/1999/xhtml", "html"));

        Node body = root.appendChild(doc.createElementNS("http://www.w3.org/1999/xhtml", "body"));

        body.setTextContent("This is a test.");

        Processor processor = new Processor(true);

        DocumentBuilder docBuilder = processor.newDocumentBuilder();

        XdmNode xdmDoc = docBuilder.wrap(doc);

        System.out.println(xdmDoc);

it is expected/normal that the namespace is missing, i.e. the output is

<html>
   <body>This is a test.</body>
</html>

as namespace fixup only occurs in an XSLT processing/serialization step like pushing the tree through an identity transformation?

In that case I guess I can for the time being stop worrying about the lack of namespace nodes in the AngleSharp DOM.

RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay almost 3 years ago

Generally, if a DOM is built programmatically, it can violate all sorts of constraints (e.g. element names can be invalid), and the consequences if it doesn't satisfy XDM constraints are unpredictable.

    (1-7/7)

    Please register to reply