Project

Profile

Help

Bug #2622

Updated by Michael Kay about 8 years ago

(created initially in XSL-List: The Open Forum on XSL, a mailing list managed by Mulberry Technologies, Inc..) 

 Dear all, 

 For some reasons, I need to escape specific characters in the output and also need to produce normalized Unicode in NFC. 

 Here is my input : 

 <pre><code class="xml"> 
 <inputText>”; ;</inputText> <!-- which is : (U+201D U+003B U+0020 U+003B) --> 
 </code></pre> 

 Here is the output properties of my stylesheet : 

 <pre><code class="xml"> 
 <xsl:output method="xml" version="1.0" encoding="UTF-8" 
         indent="yes" omit-xml-declaration="no"  
         use-character-maps="unsupported_characters" 
         normalization-form="NFC" 
     /> 
 </code></pre> 

 The character-map definition :  

 <pre><code class="xml"> 
 <xsl:character-map name="unsupported_characters"> 
         <xsl:output-character character="&#8220;" string="&quot;"/> 
         <xsl:output-character character="&#8221;" string="&quot;"/> 
     </xsl:character-map> 
 </code></pre> 

 With this template :  

 <pre><code class="xml"> 
 <xsl:template match="/ "> 
     <shortDescription><xsl:value-of select=" inputText "/></shortDescription> 
 </xsl:template> 
 </code></pre> 

 Now the output : 

 <pre><code class="xml"> 
 <shortDescription>"; ;</shortDescription> <!-- which is (U+0022    U+037E    U+0020    U+003B) --> 
 </code></pre> 

 Why the semicolon ( U+003B) is translated into Greek question mark ( U+037E) just after the escaped quote while the next semi colon is kept ? 
 But the right question is why my semicolon is escaped into Greek question mark ? 

 To go further : 

 1- If I do not use character-map the result is : 

 <pre><code class="xml"> 
 <shortDescription>”; ;</shortDescription> <!-- which is (U+201D U+003B U+0020 U+003B) --> 
 </code></pre> 

 2- If I do not normalize the Unicode (without normalization-form="NFC" attribute) 

 <pre><code class="xml"> 
 <shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+003B U+0020 U+003B) --> 
 </code></pre> 

 3- same behavior with other characters combinations : 
 * double comma quotation mark + K 
 <pre><code class="xml"> 
 <inputText>"K</inputText> <!-- which is : (U+201D U+004B) --> 
 <!-- output --> 
 <shortDescription>"K</shortDescription> <shortDescription>"K</shortDescription> <!-- which is (U+0022    U+212A) --> 
 </code></pre> 

 * double comma quotation mark + Chinese glyph 
 <pre><code class="xml"> 
 <inputText>"力</inputText> <!-- which is : (U+201D U+529B) --> 
 <!-- output --> 
 <shortDescription>"力</shortDescription> <shortDescription>"力</shortDescription> <!-- which is (U+0022    U+F98A) --> 
 </code></pre> 


 4- In addition, Wolfgan L. in the same thread answered : 

 ??Even the solitary identity transformation of the semicolon 0x3B 
      <xsl:output-character character=";" string=";"/> 
 results in a translation to U+037E of all semicolons. Seems to be a bug. 
 SaxonHE 9.6.0.1?? 


 Thanks for the help 

 Lancelot

Back