Bug #4702: CR (x13) under node.js - SaxonJS - Saxonica Developer Community

Planio Inbox

Update this Issue by sending or forwarding an email with "[#4702][517540]" in its subject to:

inbox+saxonica+f38e+saxon-js@plan.io

To send an OpenPGP encrypted email please use our public key.

Actions

Send by e-mail Copy link

Bug #4702

closed

CR (x13) under node.js

Added by Michael Kay about 4 years ago. Updated over 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Serialization

Sprint/Milestone:

Start date:

2020-09-01

Due date:

% Done:

100%

Estimated time:

Applies to JS Branch:

Fix Committed on JS Branch:

Fixed in JS Release:

Saxon-JS 2.1

SEF Generated with:

Platforms:

Company:

Contact person:

Additional contact persons:

Description

In a version of the repository cloned under GitHub on Windows, the sample books.xml file has CRLF line endings. A transformation of this file is failing during serialization with the error that codepoint 13 cannot be serialized in the current encoding.

There seem to be two things wrong here. Firstly, the XML parser should normalize line endings: from a quick look through the SAX2 parser code, it doesn't appear to be doing so. Secondly, if codepoint 13 does find its way through to the serializer, it should be serialized as .

History
Notes
Property changes

Actions

Copy link

Updated by John Lumley about 4 years ago

I certainly didn't make any modifications for normalisation of line endings. Working on a Windows machine you might have expected me to see this serialisation issue, but I haven't run the samples/books.xsl stylesheet, so may not have encountered the specific encoding.

Actions

Copy link

Updated by Michael Kay about 4 years ago

The code is normalizing line endings in NodeJSPlatform.parseXmlFromString (line 229). This appears to be working OK, so I can't see where the failure comes from.

Furthermore, the XML serializer seems to be correctly serializing \D as 

Actions

Copy link

Updated by Michael Kay about 4 years ago

Note that the error message (Character decimal 13 is not available in the chosen encoding) is produced by Serialize.js encode() method when charRefsAllowed=false. This happens when outputting names, comments, CDATA sections, etc and when disable-output-escaping is set; that is, when converting CR to  is not an option. I don't know why this is happening, but the error message appears wrong: in such situations codepoint 13 should probably be output as itself (the statement that the character is not available in the chosen encoding is factually incorrect).

Looking at the books.xsl stylesheet, it does

<xsl:comment><xsl:copy-of select="unparsed-text('books.txt')"/></xsl:comment>

and I suspect this is where the error is coming from. If GitHub has changed books.txt to contain CRLF line endings, this will not be subject to line ending normalization because there is no XML parsing, so we're outputting a comment containing a CR character, and I think this should be output as an unescaped CR.

Looking at the W3C spec (Serialization 3.1) the relevant rule in the lede of §5 is

A consequence of this rule[§] is that certain characters MUST be output as character references, to ensure that they survive the round trip through serialization and parsing. Specifically, CR, NEL and LINE SEPARATOR characters in text nodes MUST be output respectively as " ", "", and " ", or their equivalents; while CR, NL, TAB, NEL and LINE SEPARATOR characters in attribute nodes MUST be output respectively as " ", " ", " ", "", and " ", or their equivalents. In addition, the non-whitespace control characters #x1 through #x1F and #x7F through #x9F in text nodes and attribute nodes MUST be output as character references.

§ "This rule" means the round-tripping rule, ie. the rule that serialization followed by parsing must leave the document unchanged.

So it doesn't seem to say explicitly how a CR character in a comment should be handled. Arguably §5.1.3 applies:

When outputting any other character that is defined in the selected encoding, the character MUST be output using the correct representation of that character in the selected encoding.

But this conflicts with the round-tripping rule.

So I think the spec leaves us a choice between two actions, both of which violate the spec: either output the CR "as is", or raise an error. But the current error message is misleading.

Actions

Copy link

Updated by Michael Kay about 4 years ago

Looking at the Java code, it seems (without actual testing) that a CR appearing in a comment will be output "as is", with no error.

Actions

Copy link

Updated by Michael Kay about 4 years ago

Added XSLT3 test case output-0723. Confirmed that it succeeds on Saxon-J (the CR is serialized as a raw unescaped CR character), and fails SERE0008 under Saxon-JS.

Actions

Copy link