Project

Profile

Help

Support #3903

closed

Line-length feature breaks preformatted texts in HTML output

Added by Matthias Schäfer about 6 years ago. Updated almost 3 years ago.

Status:
Won't fix
Priority:
Normal
Assignee:
Category:
Serialization
Sprint/Milestone:
-
Start date:
2018-09-12
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

I have been working on an XSL transformation for XHTML output which includes some preformatted texts. I have realized just a few days ago that Saxon has a built-in feature to automatically break lines in that output mode. Obviously this feature is incompatible with preformatted text as it inserts arbitrary lines breaks. After doing some research several questions about the feature came to my mind.

  1. Is Saxonica aware of the fact that enabling a feature by default that alters the transformed data silently is probably a bad idea? One would expect that such a feature must explicitly be turned on. One would also expect that Saxon gives you a conspicuous warning once it messes with your data in that way.

  2. Is it correct that I can get rid of that harmful feature only by buying and paying for the Professional or Enterprise Edition?

  3. Which is the proper way to disable the feature on a element-by-element basis? The person who had the idea to enable the feature by default sure will also have had an idea in mind what to do if I realized that I actually want to use the automatic line breaks wherever possible, but not on spots where it would break my preformatted texts, e.g. in a HTML pre element or when using the CSS white-space pre formatting.

Regards


Files

1.png (5.43 KB) 1.png Matthias Schäfer, 2018-09-13 10:06
suppress-div-indentation-xhtml-method1.xsl (390 Bytes) suppress-div-indentation-xhtml-method1.xsl Martin Honnen, 2018-09-13 14:46
suppress-div-indentation-xml-method1.xsl (388 Bytes) suppress-div-indentation-xml-method1.xsl Martin Honnen, 2018-09-13 14:46
suppress-indentation-xml-method2.xq (277 Bytes) suppress-indentation-xml-method2.xq Martin Honnen, 2018-09-13 14:46
suppress-indentation-xhtml-method2.xq (279 Bytes) suppress-indentation-xhtml-method2.xq Martin Honnen, 2018-09-13 14:46
unindented-xhtml-doc1.xml (198 Bytes) unindented-xhtml-doc1.xml sample XHTML input doc Martin Honnen, 2018-09-13 14:46
unindented-xhtml-frag1.xml (142 Bytes) unindented-xhtml-frag1.xml sample XHTML fragment Martin Honnen, 2018-09-13 14:46
Data.xml (540 Bytes) Data.xml Matthias Schäfer, 2018-09-24 10:21
Transform.xsl (698 Bytes) Transform.xsl Matthias Schäfer, 2018-09-24 10:21
Actions #1

Updated by Michael Kay about 6 years ago

Thanks for raising this. Yes, I'm aware that the line wrapping occasionally causes inconvenience. However, I think that Saxon is behaving in a way that is fully conformant with the W3C serialization specification. For the HTML and XHTML output methods, this states that indentation whitespace can be added adjacent to any whitespace character in a text node, except in a "formatted element" (pre, script, style, title, and textarea).

See https://www.w3.org/TR/xslt-xquery-serialization-31/#XHTML_INDENT, §7.4.3, first and third bullets.

The thinking in the spec here is that in ordinary HTML and XHTML text nodes, multiple whitespace characters are equivalent to single whitepace characters as far as the browser is concerned, and therefore indentation makes no difference to the rendition.

You say that the indentation is disrupting "preformatted text". How is the "preformatted text" tagged, if not with one of the tags listed above?

If you want to disable the feature for selected elements you should be able to do so using the suppress-indentation output parameter.

Another workaround would be to use the XML output method. You could even serialize as XML with indentation, and then re-serialize this as XHTML without indentation.

Actions #2

Updated by Matthias Schäfer about 6 years ago

Thanks for your reply.

In your logic of tagging preformatted text by element name (pre, script etc.) you are missing a crucial part of HTML: CSS. I am using CSS to mark the elements containing preformatted texts, namely the white-space property. So there must be a way to stop Saxon from applying the automatic wrapping for each element individually. But even if I try to stick to mark the affected elements only by name I have no success. I have tried placing my contents in pre elements and then added suppress-indentation="pre" to the output element of the XSL. I still get this: Note the wrapped line. The original text contains no wrapped line.

Regarding your proposal of using re-serialization. I am not yet ready to put bigger efforts into changing the XSL processing on my side as long as it seems to me that it is the fault of Saxon's clumsy line-length feature, which is enabled by default and silently, not capable of excluding certain elements individually and not even disablable in the Home Edition.

Actions #3

Updated by Michael Kay about 6 years ago

Our general policy is that Saxon-HE offers conformance to the W3C specifications, and Saxon-PE/EE offer extensions that go beyond the W3C specifications where we decide that extra functionality will be needed by some users. I agree with you that the indentation rules in the W3C spec fail to take account of the possibility of preformatted output being requested at the CSS level (except to the extent that indentation can be suppressed using the suppress-indentation attribute in XSLT 3.0. Perhaps this is why it was added - I don't recall.) Also, of course, you have the option to not use indentation at all: on the rare occasions when you actually need to look at the HTML at "source" level, many editors and browsers provide you with the ability to indent it at viewing time.

When using the XHTML output method, the pre element is considered a preformatted element, and suppresses line-wrapping, if

(a) it is in the XHTML namespace, or

(b) it is in no namespace, and <xsl:output html-version="5.0"/> is specified.

In your screen snapshot, it's not clear whether these conditions are satisfied.

Actions #4

Updated by Matthias Schäfer about 6 years ago

Thank you for your reply.

As you mentioned yourself, the specifications and thus Saxon fails to take into account certain use cases. That way, it is quite irrelevant to me if Saxon goes beyond the specifications if at the same time it fails to meet some real practical aspects, from my point of view. As I mentioned before, I also did not have success using suppress-indentation to disable automatic wrapping in elements with certain names. What might I be doing wrong here?

I don't have the option to use no indentation at all, as I want to have the chance to compare the output with diff tools.

Your first solution to make Saxon consider pre a preformatted element, putting it in the XHTML namespace, works. Your second solution, adding the html-version attribute to the xsl:output element, does not work. Unfortunately using pre elements does not solve my issue as I only tried pre elements to see if disabling automatic wrapping works in general, but in my case I need to use CSS to mark preformatted texts rather than pre elements. Do you have any other ideas?

I am still struggling to understand why the automatic line wrapping is enabled by default in Saxon if at the same time it is not unequivocally capable of recognizing all preformatted text (which it can't because there is always the chance that the document uses CSS for that purpose). Furthermore, I don't even have the option to completely disable this nowhere near perfect feature in the Home Edition, why? From my point of view, this feature should be enablable only if you have the chance to disable it. And I would go even further, it should only be enabled if explicitly requested by the user, never by default.

Actions #5

Updated by Michael Kay about 6 years ago

I'm surprised that you should be using diff tools to compare serialized and indented output. I always switch indentation off when comparing output, because indentation is inherently implementation-defined and unpredictable. In fact, if you're comparing files you should really be putting it through a canonicalizer, to deal with issues such as variations in attribute order (alternatively, compare the parsed trees using deep-equal()).

I'm not sure why suppress-indentation didn't work for you (you didn't supply any details) but my guess would be that it's a namespace issue.

The W3C specs give considerable discretion to implementors on the details of serialization, while also imposing many constraints. There are always going to be users who don't like the choices we make, and we have to live with that; where we see a need (or, if you like, an opportunity for added value), we provide additional control over serialization using extension attributes.

The product is also designed to make serialization highly customizable at the Java level. If you don't like the choices Saxon makes, you can usually override them by customizing the serialization code at the Java level. By nominating your own SerializerFactory that subclasses the standard SerializerFactory, for example, you could override the newXHTMLIndenter() method to supply your own subclass of HTMLIndenter, and in that subclass you could override the method getLineLength() to return Integer.MAX_VALUE which would effectively suppress line wrapping for all elements (untested suggestion).

Actions #6

Updated by Martin Honnen about 6 years ago

Mike, I have tried to use suppress-indentation with XSLT and XQuery and method="xml" and method="xhtml", it seems it works with Saxon with method="xml" but not with method="xhtml" while with XSLT it works with Altova with both methods and with XQuery it works with both methods with Altova as well as BaseX.

I used Saxon 9.8.0.14 HE from the command line to test, the code simply copies the input files (one an unindented XHTML document, the other an unindented XHTML fragment) through to the output, but with indentation as a serialization option.

For Saxon I get the output

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Test</title>
   </head>
   <body>
      <section>
         <h1>Test</h1>
         <div><p>...</p><p>...</p></div>
      </section>
   </body>
</html>

for method xml and declare option output:suppress-indentation 'xhtml:div' while for method xhtml I get

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Test</title></head>
   <body>
      <section>
         <h1>Test</h1>
         <div>
            <p>...</p>
            <p>...</p>
         </div>
      </section>
   </body>
</html>
Actions #7

Updated by Matthias Schäfer about 6 years ago

Michael Kay wrote:

I'm surprised that you should be using diff tools to compare serialized and indented output. I always switch indentation off when comparing output, because indentation is inherently implementation-defined and unpredictable.

Without indentation you also don't get any line beaks, except for explicit line breaks, so you end up with almost all of the output being in the same line. How do you compare that with a diff tool?

Michael Kay wrote:

In fact, if you're comparing files you should really be putting it through a canonicalizer, to deal with issues such as variations in attribute order (alternatively, compare the parsed trees using deep-equal()).

I don't want to compare the output with a previous output. I am using the output as an input to another tool and I want to be able to compare its output with the output of Saxon. Of course I could always add another tool to the tool chain or try to modify the other tool but compared to the option to just being able to disable the unwanted, unrequested feature to automatically wrap lines, this would be very disproportionate.

Michael Kay wrote:

I'm not sure why suppress-indentation didn't work for you (you didn't supply any details) but my guess would be that it's a namespace issue.

I added suppress-indentation="my-element-name" to the xsl:output element and named my elements my-element-name.

Michael Kay wrote:

The W3C specs give considerable discretion to implementors on the details of serialization, while also imposing many constraints. There are always going to be users who don't like the choices we make, and we have to live with that; where we see a need (or, if you like, an opportunity for added value), we provide additional control over serialization using extension attributes.

I would like to know the background of this choice. Suppose you are using a tool the task of which is to clean up your harddisk from unwanted data. Along all the options you can set up, there is an option to delete all empty directories. This option is on by default, it cannot be turned off in the Edition you use and it is also silent when deleting an empty directory. Though this option may be useful for many users, you can imagine there are some user who will justifiably complain about it. I would just like to know the reasoning behind providing such a feature in the way Saxon does.

Michael Kay wrote:

The product is also designed to make serialization highly customizable at the Java level. If you don't like the choices Saxon makes, you can usually override them by customizing the serialization code at the Java level. By nominating your own SerializerFactory that subclasses the standard SerializerFactory, for example, you could override the newXHTMLIndenter() method to supply your own subclass of HTMLIndenter, and in that subclass you could override the method getLineLength() to return Integer.MAX_VALUE which would effectively suppress line wrapping for all elements (untested suggestion).

Aside from the fact that I am using the .Net variant, this solution would also require many more efforts on my side compared to just disabling that unwanted, unrequested feature. I am using Saxon from command line and one would justifiably argue that it is unreasonable to change the whole tool chain now just to be able to disable that feature.

Actions #8

Updated by Michael Kay about 6 years ago

  • Category set to Serialization
  • Status changed from New to Won't fix
  • Assignee set to Michael Kay

I'm afraid I'm going to close this with a "won't fix". I'm sorry that the product doesn't do what you would like it to, but we can't satisfy everyone all of the time. There are workarounds in HE that involve effort on your part, and in PE that involve money on your part, but the product is working as designed and I don't accept that the design is wrong.

The W3C specs get very carefully reviewed by working groups before publication, and there's a long public consultation during which all comments are carefully considered. If no-one during that process raised the objection that you are now raising, then it suggests that you have a minority requirement.

Actions #9

Updated by Matthias Schäfer about 6 years ago

What about the issue that suppress-indentation does not seem to work with xhtml output, as Mr Honnen already pointed out? I tried it with the attached files. As you can see, when you transform Data.xml with Transform.xml you end up with indented div elements, no matter if it is in xhtml namespace or not. If you change the output method to xml, the indentation is suppressed as expected. For the moment I would be lucky if at least suppress-indentation would work.

Actions #10

Updated by Michael Kay about 6 years ago

The issue of suppress-indentation not working for HTML and XHTML is covered by bug #3841. We've fixed it for 9.9 (coming out real soon) but aren't planning to retrofit the fix to 9.8.

Actions #11

Updated by John Ulric almost 3 years ago

For reference: There's a related issue #5018 with support-indentation not working with word-wrapping of HTML and XHTML output, fixed in 10.6, which I found helpful.

Please register to edit this issue

Also available in: Atom PDF