Project

Profile

Help

Bug #6212

closed

net.sf.saxon.s9api.Serializer tries to guess doctype from content and fails at it.

Added by Vivien Guillet about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Documentation
Sprint/Milestone:
-
Start date:
2023-10-02
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
12, trunk
Fix Committed on Branch:
12, trunk
Fixed in Maintenance Release:
Platforms:
Java

Description

Given a org.w3c.dom.Document named doc, I run serialisation using

Processor processor = new Processor(false);
Serializer serializer = processor.newSerializer();
Source domSource = new DOMSource(doc);
String s = serializer.serializeToString(domSource);

When the root node of the document is anything other than html the document is correctly serialized, but is incorrectly serialized when the root element is html

i.e. a document consisting of only a root node whose tagName is " ztml " is serialized as

<?xml version="1.0" encoding="UTF-8"?><ztml/>

But if the tagName of the root node append to be " html ", the serializer makes a wild guess at the doctype and outputs

The root node should not be used to guess the doctype resulting in some xml documents beeing serialized as a html documents.

(If a temporary fix or bypass exists for it, it would be great for now !)

Thanks.

Actions #1

Updated by Vivien Guillet about 1 year ago

Sorry, the wrong output is missing from the above report !

output for a document consisting of only a root node whose tagName is "html"

<!DOCTYPE HTML><html></html>

Actions #2

Updated by Michael Kay about 1 year ago

Unfortunately the default serialization method is defined by rules in the XSLT specification.

We're not absolutely bound to follow those rules when serialization is invoked using our own API, but it's simpler if we do; and perhaps more to the point, if we were to change the rules now it would be very disruptive for many existing users.

The XSLT rules are defined (for 3.0) in section 26 of the spec:

The default for the method attribute depends on the contents of the tree being serialized, and is chosen as follows. If the document node of the final result tree has an element child, and any text nodes preceding the first element child of the document node of the result tree contain only whitespace characters, then:

If the expanded QName of this first element child has local part html (in lower case), and namespace URI http://www.w3.org/1999/xhtml, then the default output method is normally xhtml. However, if the effective version of the outermost element of the principal stylesheet module in the top-level package has the value 1.0, and if the result tree is generated implicitly (rather than by an explicit xsl:result-document instruction), then the default output method in this situation is xml. If the expanded QName of this first element child has local part html (in any combination of upper and lower case) and a null namespace URI, then the default output method is html. In all other cases, the default output method is xml.

We could perhaps be more explicit in the Javadoc for the Serializer class how the defaults are chosen.

It's easy enough, of course, to set the method property explicitly if you don't like the rules for choosing a default.

Actions #3

Updated by Vivien Guillet about 1 year ago

Alright ! setting method property explicitly with serializer.setOutputProperty(Serializer.Property.METHOD,"xml"); works fine.

It surely would be a wonderful idea to makes the automagic part of the serialization clear in the Serializer javadoc, maybe even by providing examples.

Anyway thanks a lot for your quick answer.

Actions #4

Updated by Michael Kay about 1 year ago

I have made improvements to the Javadoc of the s9api Serializer class.

Actions #5

Updated by Michael Kay about 1 year ago

  • Category set to Documentation
  • Status changed from New to Resolved
  • Assignee set to Michael Kay
  • Applies to branch 12, trunk added
  • Fix Committed on Branch 12, trunk added
Actions #6

Updated by O'Neil Delpratt about 1 year ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 12.4 added
  • Fixed in Maintenance Release deleted (12.1)

Bug fix applied in the Saxon 12.4 maintenance release

Please register to edit this issue

Also available in: Atom PDF