NET XdmNode built by DocumentBuilder.Build from XmlReader
Added by Martin Honnen over 3 years ago
I am trying to find a way to feed an AngleSharp DOM document to Saxon on .NET, the only way I have found so far is using the AngleSharp.XPath package which exposes a CreateNavigator
method to return an XPathNavigator
which has a method ReadSubtree
to return an XmlReader
which I then pass to Saxon's DocumentBuilder.Build
method.
With some fixes to the online AngleSharp.XPath I get Saxon (using 10.5.1) to build an XdmNode, what is interesting is that the element nodes of an HTML5 document end up in the XHTML namespace but the OuterXml
or serialization of the XdmNode doesn't show the XHTML namespace, the elements are serialized in no namespace.
So I get an XdmNode document which has elements with a NodeName.Uri
being http://www.w3.org/1999/xhtml
and I need to declare that namespace to select elements in the document with XPath, based on that Saxon has a consistent DOM with elements in the XHTML namespace, on the other hand OuterXml
doesn't show the namespace.
Is that partly a flaw in Saxon?
I see that the created XmlReader doesn't output namespace declaration attributes (i.e. it doesn't report an attribute named xmlns
with the value http://www.w3.org/1999/xhtml
) so that is certainly some difference to reading lexical markup. .
NET's "native" XmlDocument
, when pulling in the XmlReader created by ReadSubtree, builds a DOM with elements in the XHTML namespace and serializes with the XHTML namespace.
Replies (7)
Please register to reply
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen over 3 years ago
I guess the way Saxon works out the namespaces is in https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/saxon10/entry/src/main/java/net/sf/saxon/dotnet/DotNetPullProvider.java#L180
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen over 3 years ago
It looks as if Saxon's normal serialization fails to output the namespace of the elements as a namespace declaration; however, using xdmNode.WriteTo(XmlWriter.Create(...))
, i.e. having Saxon write to a .NET XmlWriter, manages to output the namespace declaration. That is probably also why XmlDocument or XDocument output the namespace declaration, as they rely on the underlying .NET XmlWriter implementations to serialize trees.
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay over 3 years ago
The Saxon/.NET product basically accepts XML input in three ways:
(a) via its embedded version of the Apache Xerces parser (Xerces/J converted to .NET using IKVMC)
(b) via an instance of the Microsoft System.Xml.XmlReader interface
(c) as a wrapped instance of the Microsoft XML DOM.
It's probably also possible to supply input directly using the low-level Receiver interface, but that's only for the intrepid.
It looks like you're trying to do (b), but with an implementation of XmlReader that differs in some ways from Saxon's expectations of the interfaces, particularly in the area of namespaces. Since XmlReader is only specified rather informally, it's difficult to be definitive about whether that would be a nonconformance. Generally we would have to say that using any XmlReader other than the standard Microsoft ones is untested and therefore unsupported; on the other hand we're happy to look at making changes if they're testable.
Support for HTML-style DOMs is always tricky because there's no universal mapping from HTML to XDM.
Saxon's serializer doesn't do any namespace fixup; it assumes that it's starting with an XDM instance that satisfies all the XDM consistency constraints.
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen over 3 years ago
public void WriteTo(XmlWriter writer)
{
JNodeInfo node = ((JNodeInfo)value);
JDotNetReceiver receiver = new JDotNetReceiver(writer);
receiver.setPipelineConfiguration(node.getConfiguration().makePipelineConfiguration());
receiver.open();
node.copy(receiver, net.sf.saxon.om.CopyOptions.ALL_NAMESPACES, JLoc.NONE);
receiver.close();
}
and the node.copy(receiver, net.sf.saxon.om.CopyOptions.ALL_NAMESPACES, JLoc.NONE);
perform that namespace fixup (e.g. ensuring that an element node in a certain namespace emits an XML declaration for that namespace)?
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay over 3 years ago
The Saxon code shown in WriteTo() won't do any namespace fixup, but the XmlWriter might - I don't know.
Saxon namespace fixup (in 10.x) is implemented essentially in the ComplexContentOutputter
class.
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Martin Honnen over 3 years ago
So even with Saxon Java, if a in-memory DOM tree is built and wrapped, as in
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
Document doc = documentBuilderFactory.newDocumentBuilder().newDocument();
Node root = doc.appendChild(doc.createElementNS("http://www.w3.org/1999/xhtml", "html"));
Node body = root.appendChild(doc.createElementNS("http://www.w3.org/1999/xhtml", "body"));
body.setTextContent("This is a test.");
Processor processor = new Processor(true);
DocumentBuilder docBuilder = processor.newDocumentBuilder();
XdmNode xdmDoc = docBuilder.wrap(doc);
System.out.println(xdmDoc);
it is expected/normal that the namespace is missing, i.e. the output is
<html>
<body>This is a test.</body>
</html>
as namespace fixup only occurs in an XSLT processing/serialization step like pushing the tree through an identity transformation?
In that case I guess I can for the time being stop worrying about the lack of namespace nodes in the AngleSharp DOM.
RE: NET XdmNode built by DocumentBuilder.Build from XmlReader - Added by Michael Kay over 3 years ago
Generally, if a DOM is built programmatically, it can violate all sorts of constraints (e.g. element names can be invalid), and the consequences if it doesn't satisfy XDM constraints are unpredictable.
Please register to reply