Project

Profile

Help

Bug #2622

closed

Combining use-character-maps and normalization-form="NFC" attributes produce unwanted output

Added by Lancelot Meurillon about 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Serialization
Sprint/Milestone:
Start date:
2016-02-16
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.4, 9.5, 9.6, 9.7
Fix Committed on Branch:
9.6, 9.7
Fixed in Maintenance Release:
Platforms:

Description

(created initially in XSL-List: The Open Forum on XSL, a mailing list managed by Mulberry Technologies, Inc..)

Dear all,

For some reasons, I need to escape specific characters in the output and also need to produce normalized Unicode in NFC.

Here is my input :

<inputText>”; ;</inputText> <!-- which is : (U+201D U+003B U+0020 U+003B) -->

Here is the output properties of my stylesheet :

<xsl:output method="xml" version="1.0" encoding="UTF-8"
        indent="yes" omit-xml-declaration="no" 
        use-character-maps="unsupported_characters"
        normalization-form="NFC"
    />

The character-map definition :

<xsl:character-map name="unsupported_characters">
        <xsl:output-character character="&#8220;" string="&quot;"/>
        <xsl:output-character character="&#8221;" string="&quot;"/>
    </xsl:character-map>

With this template :

<xsl:template match="/ ">
    <shortDescription><xsl:value-of select=" inputText "/></shortDescription>
</xsl:template>

Now the output :

<shortDescription>"; ;</shortDescription> <!-- which is (U+0022  U+037E  U+0020  U+003B) -->

Why the semicolon ( U+003B) is translated into Greek question mark ( U+037E) just after the escaped quote while the next semi colon is kept ?

But the right question is why my semicolon is escaped into Greek question mark ?

To go further :

1- If I do not use character-map the result is :

<shortDescription>”; ;</shortDescription> <!-- which is (U+201D U+003B U+0020 U+003B) -->

2- If I do not normalize the Unicode (without normalization-form="NFC" attribute)

<shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+003B U+0020 U+003B) -->

3- same behavior with other characters combinations :

  • double comma quotation mark + K
<inputText>"K</inputText> <!-- which is : (U+201D U+004B) -->
<!-- output -->
<shortDescription>"K</shortDescription> <!-- which is (U+0022  U+212A) -->
  • double comma quotation mark + Chinese glyph
<inputText>"力</inputText> <!-- which is : (U+201D U+529B) -->
<!-- output -->
<shortDescription>"力</shortDescription> <!-- which is (U+0022  U+F98A) -->

4- In addition, Wolfgan L. in the same thread answered :

??Even the solitary identity transformation of the semicolon 0x3B

 <xsl:output-character character=";" string=";"/>

results in a translation to U+037E of all semicolons. Seems to be a bug.

SaxonHE 9.6.0.1??

Thanks for the help

Lancelot

Actions #1

Updated by Michael Kay about 8 years ago

  • Description updated (diff)
  • Category set to Serialization
  • Status changed from New to In Progress
  • Assignee set to Michael Kay

Will investigate. I have reproduced your results on Saxon-EE 9.7.0.2.

Note from the spec: the spec states that we should first apply character mapping; then we should apply normalization, but only to those characters that were not affected by character mapping.

Actions #2

Updated by Michael Kay about 8 years ago

I suspect we have no test cases for using character maps in conjunction with Unicode normalization and that the combination is simply not working.

After applying character maps we insert NULL characters into the character string to mark sections of the string in which output escaping should be disabled (and unicode normalization should be disabled). But the Unicode normalizer does not appear to be recognizing the presence of these markers and is treating them as ordinary characters to be normalized, which is producing apparently illogical results.

Actions #3

Updated by Michael Kay about 8 years ago

  • Status changed from In Progress to Resolved
  • Applies to branch 9.4, 9.5, 9.6, 9.7 added
  • Fix Committed on Branch 9.6, 9.7 added

I have committed a patch (to UnicodeNormalizer.java) on the 9.6 and 9.7 branches.

Actions #4

Updated by Michael Kay about 8 years ago

Test case character-map-025 added to XSLT3 test suite.

Actions #5

Updated by O'Neil Delpratt about 8 years ago

  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.7.0.3 added

Bug fix applied in the 9.7.0.3 maintenance release. Leave open until fix applied in the 9.6 branch

Actions #6

Updated by O'Neil Delpratt almost 8 years ago

  • Status changed from Resolved to Closed
  • Fixed in Maintenance Release 9.6.0.9 added
  • Fixed in Maintenance Release deleted (9.7.0.3)

Bug fix applied in the Saxon 9.6.0.9 maintenance release.

Actions #7

Updated by O'Neil Delpratt almost 8 years ago

  • Fixed in Maintenance Release 9.7.0.3 added
Actions #8

Updated by O'Neil Delpratt almost 8 years ago

  • Sprint/Milestone set to 9.7.0.3

Please register to edit this issue

Also available in: Atom PDF