Project

Profile

Help

Support #1850

closed

Indenting issue with embedded XML

Added by Vadim Peretokin almost 11 years ago. Updated almost 11 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
2013-07-21
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Hi,

I have a demonstrative XSL transformation (transformation.xsl) which embeds an XML snippet (XML fragment.xml) inside another as CDATA for DocBooks purposes. The issue that I'm running into is that all Saxon transformation engines (6.5.5, HE/PE/EE 9.4.06 both in XSLT 1.0 and 2.0) are generating output with the embedded XML snippets indenting going wrong.

The attached saxon output.xml shows the output generated by Saxon engines - observe how and at the bottom of the document are not indented properly.

However, the same snippet and transformation, when run with Xalan, does produce the desired output (xalan output.xml).

I'm not quite certain what the issue is - I have a hunch it might be with the CDATA wrappers, but I'm not certain. If there's a better way to do it that fixes the problem, I'll take it :)

Any ideas on what is going wrong?

Thank you.


Files

issueswithindentinganembeddeddocument.zip (3.24 KB) issueswithindentinganembeddeddocument.zip Vadim Peretokin, 2013-07-22 00:58
testing.zip (1.73 KB) testing.zip Vadim Peretokin, 2013-07-23 01:31
Actions #1

Updated by Michael Kay almost 11 years ago

I changed your stylesheet to use match="/*" instead of match="element", because there is no element named "element" in your source. I also changed it to use the XML fragment you supplied in the ZIP file in place of the long http URL. The result doesn't match the result you showed precisely, because it contains comments which aren't present in your sample output.

The basic reason for the poor indentation is that (because of the ellipses) your XML fragment contains mixed content. The indendation rules have been made more strict over the years for mixed content; the serializer is not allowed to change the value of a non-whitespace text node; it can only add whitepace text nodes where there would otherwise be none, or change the content of a whitespace-only text node. There's a balance here between fidelity and prettiness of the output, and the rules since XSLT 1.0 have moved towards requiring processors to preserve greater fidelity.

Actions #2

Updated by Michael Kay almost 11 years ago

  • Status changed from New to In Progress
  • Assignee set to Michael Kay
Actions #3

Updated by Vadim Peretokin almost 11 years ago

Ahh, sorry, I was running the transformation on another xml and was then embedding the fragment within it. I didn't do a good job of extracting the test case.

I wonder what happened to the comments however - I failed to notice that they are missing in the saxon output. The document embedded was the same - but I'll double-check in a few hours when I have access to it again.

Would embedding the ellipses in a comment work then, as it would be a "white-space only text node" (if comments aren't counted) and then be able to remove/add spaces as necessary?

Actions #4

Updated by Michael Kay almost 11 years ago

Generally comments are quite disruptive to nice indentation.

Actions #5

Updated by Vadim Peretokin almost 11 years ago

Hmm. I've deleted all of the ellipses for testings sake (they are necessary), but the result is still the same - even with the nodes containing whitespace.

I've attached a better sample case, and this is the output that I get: http://pastebin.com/fubTKNeG

(also apologies for discussing this on the bugtracker; I registered on and posted a mail to saxon-open on friday and the post is still not approved)

Actions #6

Updated by Michael Kay almost 11 years ago

  • Tracker changed from Bug to Support
  • Status changed from In Progress to Rejected

I'm seeing slightly different output from you with 9.5, you don't say which release your sample was generated with. However, there are "imperfections" and it's reasonably easy to explain why they exist.

Take the structuredBody element near the end, for example. It would be nice to have the start and end tags vertically aligned with each other. The reason they aren't aligned is as follows: Saxon has indented the start tag to be indented relative to the parent element (component). But the content of the structuredBody element consists entirely of whitespace (5 newlines followed by 8 spaces, if I counted right). The serializer is not allowed to change this content: there could be a schema that defines a simpleType with a string-length facet of 13, and adding or removing spaces would make the output invalid against the schema. So the end tag has to be in column 8, regardless where the start tag was.

As I mentioned before, Saxon's indenter follows the latest specs by being rather conservative, often at the expense of producing output that is not so pretty. There is no bug here. If you think there's a case where the Saxon output doesn't conform to the spec, please identify it and re-open.

Actions #7

Updated by Vadim Peretokin almost 11 years ago

Gotcha, thanks for the detailed explanation. I think I would appreciate an option to be able to relax this strictness, as in this use case it is impractical and hindering (and no schema in the documents specify a string-length either).

Actions #8

Updated by Michael Kay almost 11 years ago

The Saxon serializer is highly customizable if you want to fine-tune the output. You can register a SerializerFactory with the Configuration; this will typically be a subclass of the standard SerializerFactory. If you override the newXMLIndenter() method, you can substitute your own algorithm for XML indentation, which might of course be an adaptation of the standard one.

Actions #9

Updated by Vadim Peretokin almost 11 years ago

Got it. Thanks!

Please register to edit this issue

Also available in: Atom PDF