Bug #6212
closednet.sf.saxon.s9api.Serializer tries to guess doctype from content and fails at it.
100%
Description
Given a org.w3c.dom.Document
named doc
, I run serialisation using
Processor processor = new Processor(false);
Serializer serializer = processor.newSerializer();
Source domSource = new DOMSource(doc);
String s = serializer.serializeToString(domSource);
When the root node of the document is anything other than html
the document is correctly serialized, but is incorrectly serialized when the root element is html
i.e. a document consisting of only a root node whose tagName is " ztml
" is serialized as
<?xml version="1.0" encoding="UTF-8"?><ztml/>
But if the tagName of the root node append to be " html
", the serializer makes a wild guess at the doctype and outputs
The root node should not be used to guess the doctype resulting in some xml documents beeing serialized as a html documents.
(If a temporary fix or bypass exists for it, it would be great for now !)
Thanks.
Updated by Vivien Guillet over 1 year ago
Sorry, the wrong output is missing from the above report !
output for a document consisting of only a root node whose tagName is "html
"
<!DOCTYPE HTML><html></html>
Updated by Michael Kay over 1 year ago
Unfortunately the default serialization method is defined by rules in the XSLT specification.
We're not absolutely bound to follow those rules when serialization is invoked using our own API, but it's simpler if we do; and perhaps more to the point, if we were to change the rules now it would be very disruptive for many existing users.
The XSLT rules are defined (for 3.0) in section 26 of the spec:
The default for the method attribute depends on the contents of the tree being serialized, and is chosen as follows. If the document node of the final result tree has an element child, and any text nodes preceding the first element child of the document node of the result tree contain only whitespace characters, then:
If the expanded QName of this first element child has local part html (in lower case), and namespace URI http://www.w3.org/1999/xhtml, then the default output method is normally xhtml. However, if the effective version of the outermost element of the principal stylesheet module in the top-level package has the value 1.0, and if the result tree is generated implicitly (rather than by an explicit xsl:result-document instruction), then the default output method in this situation is xml. If the expanded QName of this first element child has local part html (in any combination of upper and lower case) and a null namespace URI, then the default output method is html. In all other cases, the default output method is xml.
We could perhaps be more explicit in the Javadoc for the Serializer class how the defaults are chosen.
It's easy enough, of course, to set the method
property explicitly if you don't like the rules for choosing a default.
Updated by Vivien Guillet over 1 year ago
Alright ! setting method
property explicitly with serializer.setOutputProperty(Serializer.Property.METHOD,"xml");
works fine.
It surely would be a wonderful idea to makes the automagic part of the serialization clear in the Serializer javadoc, maybe even by providing examples.
Anyway thanks a lot for your quick answer.
Updated by Michael Kay over 1 year ago
I have made improvements to the Javadoc of the s9api Serializer
class.
Updated by Michael Kay over 1 year ago
- Category set to Documentation
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch 12, trunk added
- Fix Committed on Branch 12, trunk added
Updated by O'Neil Delpratt about 1 year ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.4 added
- Fixed in Maintenance Release deleted (
12.1)
Bug fix applied in the Saxon 12.4 maintenance release
Please register to edit this issue