CR (x13) under node.js
In a version of the repository cloned under GitHub on Windows, the sample books.xml file has CRLF line endings. A transformation of this file is failing during serialization with the error that codepoint 13 cannot be serialized in the current encoding.
There seem to be two things wrong here. Firstly, the XML parser should normalize line endings: from a quick look through the SAX2 parser code, it doesn't appear to be doing so. Secondly, if codepoint 13 does find its way through to the serializer, it should be serialized as
Updated by John Lumley over 3 years ago
I certainly didn't make any modifications for normalisation of line endings. Working on a Windows machine you might have expected me to see this serialisation issue, but I haven't run the
samples/books.xsl stylesheet, so may not have encountered the specific encoding.
Updated by Michael Kay over 3 years ago
Note that the error message (Character decimal 13 is not available in the chosen encoding) is produced by Serialize.js encode() method when
charRefsAllowed=false. This happens when outputting names, comments, CDATA sections, etc and when disable-output-escaping is set; that is, when converting CR to
is not an option. I don't know why this is happening, but the error message appears wrong: in such situations codepoint 13 should probably be output as itself (the statement that the character is not available in the chosen encoding is factually incorrect).
Looking at the books.xsl stylesheet, it does
and I suspect this is where the error is coming from. If GitHub has changed books.txt to contain CRLF line endings, this will not be subject to line ending normalization because there is no XML parsing, so we're outputting a comment containing a CR character, and I think this should be output as an unescaped CR.
Looking at the W3C spec (Serialization 3.1) the relevant rule in the lede of §5 is
A consequence of this rule[§] is that certain characters MUST be output as character references, to ensure that they survive the round trip through serialization and parsing. Specifically, CR, NEL and LINE SEPARATOR characters in text nodes MUST be output respectively as " ", " ", and " ", or their equivalents; while CR, NL, TAB, NEL and LINE SEPARATOR characters in attribute nodes MUST be output respectively as " ", " ", " ", " ", and " ", or their equivalents. In addition, the non-whitespace control characters #x1 through #x1F and #x7F through #x9F in text nodes and attribute nodes MUST be output as character references.
§ "This rule" means the round-tripping rule, ie. the rule that serialization followed by parsing must leave the document unchanged.
So it doesn't seem to say explicitly how a CR character in a comment should be handled. Arguably §5.1.3 applies:
When outputting any other character that is defined in the selected encoding, the character MUST be output using the correct representation of that character in the selected encoding.
But this conflicts with the round-tripping rule.
So I think the spec leaves us a choice between two actions, both of which violate the spec: either output the CR "as is", or raise an error. But the current error message is misleading.
Please register to edit this issue