Bug #2622
closedCombining use-character-maps and normalization-form="NFC" attributes produce unwanted output
100%
Description
(created initially in XSL-List: The Open Forum on XSL, a mailing list managed by Mulberry Technologies, Inc..)
Dear all,
For some reasons, I need to escape specific characters in the output and also need to produce normalized Unicode in NFC.
Here is my input :
<inputText>”; ;</inputText> <!-- which is : (U+201D U+003B U+0020 U+003B) -->
Here is the output properties of my stylesheet :
<xsl:output method="xml" version="1.0" encoding="UTF-8"
indent="yes" omit-xml-declaration="no"
use-character-maps="unsupported_characters"
normalization-form="NFC"
/>
The character-map definition :
<xsl:character-map name="unsupported_characters">
<xsl:output-character character="“" string="""/>
<xsl:output-character character="”" string="""/>
</xsl:character-map>
With this template :
<xsl:template match="/ ">
<shortDescription><xsl:value-of select=" inputText "/></shortDescription>
</xsl:template>
Now the output :
<shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+037E U+0020 U+003B) -->
Why the semicolon ( U+003B) is translated into Greek question mark ( U+037E) just after the escaped quote while the next semi colon is kept ?
But the right question is why my semicolon is escaped into Greek question mark ?
To go further :
1- If I do not use character-map the result is :
<shortDescription>”; ;</shortDescription> <!-- which is (U+201D U+003B U+0020 U+003B) -->
2- If I do not normalize the Unicode (without normalization-form="NFC" attribute)
<shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+003B U+0020 U+003B) -->
3- same behavior with other characters combinations :
- double comma quotation mark + K
<inputText>"K</inputText> <!-- which is : (U+201D U+004B) -->
<!-- output -->
<shortDescription>"K</shortDescription> <!-- which is (U+0022 U+212A) -->
- double comma quotation mark + Chinese glyph
<inputText>"力</inputText> <!-- which is : (U+201D U+529B) -->
<!-- output -->
<shortDescription>"力</shortDescription> <!-- which is (U+0022 U+F98A) -->
4- In addition, Wolfgan L. in the same thread answered :
??Even the solitary identity transformation of the semicolon 0x3B
<xsl:output-character character=";" string=";"/>
results in a translation to U+037E of all semicolons. Seems to be a bug.
SaxonHE 9.6.0.1??
Thanks for the help
Lancelot
Updated by Michael Kay almost 9 years ago
- Description updated (diff)
- Category set to Serialization
- Status changed from New to In Progress
- Assignee set to Michael Kay
Will investigate. I have reproduced your results on Saxon-EE 9.7.0.2.
Note from the spec: the spec states that we should first apply character mapping; then we should apply normalization, but only to those characters that were not affected by character mapping.
Updated by Michael Kay almost 9 years ago
I suspect we have no test cases for using character maps in conjunction with Unicode normalization and that the combination is simply not working.
After applying character maps we insert NULL characters into the character string to mark sections of the string in which output escaping should be disabled (and unicode normalization should be disabled). But the Unicode normalizer does not appear to be recognizing the presence of these markers and is treating them as ordinary characters to be normalized, which is producing apparently illogical results.
Updated by Michael Kay almost 9 years ago
- Status changed from In Progress to Resolved
- Applies to branch 9.4, 9.5, 9.6, 9.7 added
- Fix Committed on Branch 9.6, 9.7 added
I have committed a patch (to UnicodeNormalizer.java) on the 9.6 and 9.7 branches.
Updated by Michael Kay almost 9 years ago
Test case character-map-025 added to XSLT3 test suite.
Updated by O'Neil Delpratt almost 9 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 9.7.0.3 added
Bug fix applied in the 9.7.0.3 maintenance release. Leave open until fix applied in the 9.6 branch
Updated by O'Neil Delpratt over 8 years ago
- Status changed from Resolved to Closed
- Fixed in Maintenance Release 9.6.0.9 added
- Fixed in Maintenance Release deleted (
9.7.0.3)
Bug fix applied in the Saxon 9.6.0.9 maintenance release.
Updated by O'Neil Delpratt over 8 years ago
- Fixed in Maintenance Release 9.7.0.3 added
Please register to edit this issue