Bug #2622: Combining use-character-maps and normalization-form="NFC" attributes produce unwanted output - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2622

closed

Combining use-character-maps and normalization-form="NFC" attributes produce unwanted output

Added by Lancelot Meurillon almost 9 years ago. Updated over 8 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Serialization

Sprint/Milestone:

9.7.0.3

Start date:

2016-02-16

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.4, 9.5, 9.6, 9.7

Fix Committed on Branch:

9.6, 9.7

Fixed in Maintenance Release:

9.7.0.3, 9.6.0.9

Platforms:

Description

(created initially in XSL-List: The Open Forum on XSL, a mailing list managed by Mulberry Technologies, Inc..)

Dear all,

For some reasons, I need to escape specific characters in the output and also need to produce normalized Unicode in NFC.

Here is my input :

<inputText>”; ;</inputText> <!-- which is : (U+201D U+003B U+0020 U+003B) -->

Here is the output properties of my stylesheet :

<xsl:output method="xml" version="1.0" encoding="UTF-8"
        indent="yes" omit-xml-declaration="no" 
        use-character-maps="unsupported_characters"
        normalization-form="NFC"
    />

The character-map definition :

<xsl:character-map name="unsupported_characters">
        <xsl:output-character character="&#8220;" string="&quot;"/>
        <xsl:output-character character="&#8221;" string="&quot;"/>
    </xsl:character-map>

With this template :

<xsl:template match="/ ">
    <shortDescription><xsl:value-of select=" inputText "/></shortDescription>
</xsl:template>

Now the output :

<shortDescription>"; ;</shortDescription> <!-- which is (U+0022  U+037E  U+0020  U+003B) -->

Why the semicolon ( U+003B) is translated into Greek question mark ( U+037E) just after the escaped quote while the next semi colon is kept ?

But the right question is why my semicolon is escaped into Greek question mark ?

To go further :

1- If I do not use character-map the result is :

<shortDescription>”; ;</shortDescription> <!-- which is (U+201D U+003B U+0020 U+003B) -->

2- If I do not normalize the Unicode (without normalization-form="NFC" attribute)

<shortDescription>"; ;</shortDescription> <!-- which is (U+0022 U+003B U+0020 U+003B) -->

3- same behavior with other characters combinations :

double comma quotation mark + K

<inputText>"K</inputText> <!-- which is : (U+201D U+004B) -->
<!-- output -->
<shortDescription>"K</shortDescription> <!-- which is (U+0022  U+212A) -->

double comma quotation mark + Chinese glyph

<inputText>"力</inputText> <!-- which is : (U+201D U+529B) -->
<!-- output -->
<shortDescription>"力</shortDescription> <!-- which is (U+0022  U+F98A) -->

4- In addition, Wolfgan L. in the same thread answered :

??Even the solitary identity transformation of the semicolon 0x3B

 <xsl:output-character character=";" string=";"/>

results in a translation to U+037E of all semicolons. Seems to be a bug.

SaxonHE 9.6.0.1??

Thanks for the help

Lancelot

Actions

Copy link

Updated by Michael Kay almost 9 years ago

Description updated (diff)
Category set to Serialization
Status changed from New to In Progress
Assignee set to Michael Kay

Will investigate. I have reproduced your results on Saxon-EE 9.7.0.2.

Note from the spec: the spec states that we should first apply character mapping; then we should apply normalization, but only to those characters that were not affected by character mapping.

Actions

Copy link

Updated by Michael Kay almost 9 years ago

I suspect we have no test cases for using character maps in conjunction with Unicode normalization and that the combination is simply not working.

After applying character maps we insert NULL characters into the character string to mark sections of the string in which output escaping should be disabled (and unicode normalization should be disabled). But the Unicode normalizer does not appear to be recognizing the presence of these markers and is treating them as ordinary characters to be normalized, which is producing apparently illogical results.

Actions

Copy link