Bug #2622
Updated by Michael Kay almost 9 years ago
(created initially in XSL-List: The Open Forum on XSL, a mailing list managed by Mulberry Technologies, Inc..) Dear all, For some reasons, I need to escape specific characters in the output and also need to produce normalized Unicode in NFC. Here is my input : <pre><code class="xml"> <inputText>”; ;</inputText> <!-- which is : (U+201D U+003B U+0020 U+003B) --> </code></pre> Here is the output properties of my stylesheet : <pre><code class="xml"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no" use-character-maps="unsupported_characters" normalization-form="NFC" /> </code></pre> The character-map definition : <pre><code class="xml"> <xsl:character-map name="unsupported_characters"> <xsl:output-character character="“" string="""/> <xsl:output-character character="”" string="""/> </xsl:character-map> </code></pre> With this template : <pre><code class="xml"> <xsl:template match="/ "> <shortDescription><xsl:value-of select=" inputText "/></shortDescription> </xsl:template> </code></pre> Now the output : <pre><code class="xml"> <shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+037E U+0020 U+003B) --> </code></pre> Why the semicolon ( U+003B) is translated into Greek question mark ( U+037E) just after the escaped quote while the next semi colon is kept ? But the right question is why my semicolon is escaped into Greek question mark ? To go further : 1- If I do not use character-map the result is : <pre><code class="xml"> <shortDescription>”; ;</shortDescription> <!-- which is (U+201D U+003B U+0020 U+003B) --> </code></pre> 2- If I do not normalize the Unicode (without normalization-form="NFC" attribute) <pre><code class="xml"> <shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+003B U+0020 U+003B) --> </code></pre> 3- same behavior with other characters combinations : * double comma quotation mark + K <pre><code class="xml"> <inputText>"K</inputText> <!-- which is : (U+201D U+004B) --> <!-- output --> <shortDescription>"K</shortDescription> <shortDescription>"K</shortDescription> <!-- which is (U+0022 U+212A) --> </code></pre> * double comma quotation mark + Chinese glyph <pre><code class="xml"> <inputText>"力</inputText> <!-- which is : (U+201D U+529B) --> <!-- output --> <shortDescription>"力</shortDescription> <shortDescription>"力</shortDescription> <!-- which is (U+0022 U+F98A) --> </code></pre> 4- In addition, Wolfgan L. in the same thread answered : ??Even the solitary identity transformation of the semicolon 0x3B <xsl:output-character character=";" string=";"/> results in a translation to U+037E of all semicolons. Seems to be a bug. SaxonHE 9.6.0.1?? Thanks for the help Lancelot