Project

Profile

Help

Line length in Saxon-HE

Added by Jonathan Sachs almost 4 years ago

I use Saxon-HE in a production process which recently failed. The process generates a markdown file which is then used to generate HTML. The HTML contains many tables, and some of them "broke": content that should have been in tables was composed as ordinary text instead.

When I investigated this, I found that Saxon-HE is inserting line breaks in output that do not appear in input. When a line break occurs in the last column of a table row, the following line contains no column separators. The markdown parser thinks the table is ended and drops out of tabular mode.

Apparently Saxon's "line-length" extension is at fault. In commercial versions, output lines are broken if they exceed the value of the line-length parameter, which defaults to 80 characters. In Saxon-HE the line-length parameter is not implemented, but the default line length is. Thus Saxon-HE inserts line breaks in output -- something that I don't think the XSLT standard would allow -- and there's no way to stop it.

This makes Saxon-HE useless for generating markdown files, and indeed for any application where line ends are significant and correct output may include lines more than 80 characters long.

I think that a half-implemented extension in an edition that is not supposed to support the extension should be sonsidered a bug. I have two questions:

  1. What's the consensus of other users? Is this a bug?

  2. Is there any workaround?


Replies (4)

Please register to reply

RE: Line length in Saxon-HE - Added by Michael Kay almost 4 years ago

What output method are you using?

I think that Saxon only wraps lines (a) when using the HTML output method, and (b) when indent="yes" is specified.

That's certainly conformant with the serialization spec: §7.4.3 says where whitespace can and cannot be added, and (except in certain elements) it can always be added adjacent to an existing whitespace character.

https://www.w3.org/TR/xslt-xquery-serialization-31/#HTML_INDENT

It looks as if you are actually generating markdown rather than HTML, so it might be that the output is being consumed by something which attaches semantics to whitespace that would have no significance in HTML as viewed in a browser. (The rules in the W3C serialization spec are basically designed to allow anything that doesn't affect the way the output is rendered in a browser.)

If that's the case, I think the workarounds might be (a) not to use the HTML output method, (b) not to use indent="yes", (c) to use suppress-indentation to suppress indentation for particular elements.

This is a bit similar to the problems some people have when generating PHP: none of the serialization methods is really designed for the job. The other thing you could attempt is to customize your own serialization method: the Saxon serialization pipeline is designed very much to make this feasible. For example, you could register a serialization factory that replaces the standard HTMLIndenter with your own. In fact, you could substitute a subclass of HTMLIndenter that simply overrides the getLineLength() method to return a value other than 80.

RE: Line length in Saxon-HE - Added by Jonathan Sachs almost 4 years ago

I appreciate the suggestions, but I'm already using method="text" and indent="no", so changing those settings doesn't help. I haven't tried (c) suppressing indentation for particular elements, but since (b) didn't work, I suspect (c) wouldn't work either.

Writing my own serialization method exceeds the amount of work I want to invest in this, in view of the fact that playing with XSL is a necessary detour from my main job (technical writing), rather than a part of it. I'll wait a few days to see if any other ideas come up. If not I'll recommend that we pop US$75 for a copy of Saxon-PE. I assume that if I set line-length=0, Saxon-PE will suppress line breaks. If not I can set it to some huge number.

RE: Line length in Saxon-HE - Added by Michael Kay almost 4 years ago

I don't believe that the Saxon serializer will ever add any whitespace when method="text" is used, let alone doing word-wrapping to limit the length of lines. There must be something else going on. Please show what you are doing and how it fails.

Setting saxon:line-length should have no effect unless you are using the HTML or XHTML output method.

RE: Line length in Saxon-HE - Added by Jonathan Sachs almost 4 years ago

The files are proprietary, and it will take more time to sanitize them than I can afford right now. I'll respond as soon as I can.

    (1-4/4)

    Please register to reply