Greek perispomeni and normalize-unicode
Raised by Ryan Baumann on the SourceForge saxon-help list.
Various forms of characters with perispomeni seem to be handled
incorrectly with normalize-unicode (running as XSLT 2.0 in Saxon HE
normalize-unicode('ῇ̓','NFC') (U+03B7 U+0342 U+0313 U+0345) is ῇ̓
(U+1FC6 U+0313 U+0345)
correct NFC: ῇ̓ (U+1FC7 U+0313)
normalize-unicode('ῇ̓','NFD') (U+1FC7 U+0313) is ῇ̓ (U+03B7 U+0342
normalize-unicode('ῇ̓','NFD') (U+1FC6 U+0313 U+0345) is ῇ̓ (U+03B7
U+0342 U+0313 U+0345)
Other instances of incorrect NFC normalization (normalize-unicode on
these characters is idempotent):
ῇ̔ (U+1FC6 U+0314 U+0345) should be ῇ̔ (U+1FC7 U+0314)
ῷ̔ (U+1FF6 U+0314 U+0345) should be ῷ̔ (U+1FF7 U+0314)
Ὧ (U+1F69 U+0342) should be Ὧ (U+1F6F)
Ἆ (U+1F08 U+0342) should be Ἆ (U+1F0E)
Checked against both Java's java.text.Normalizer and Perl's
Unicode::Normalize as my references for "correct" NFC normalization.
The problem seems to be fairly general for any character which has a
pre-combined perispomeni form. There are probably others than just
what's here, you can see the results of running java.text.Normalizer
against a large corpus of Ancient Greek that has already been passed
through normalize-unicode in this commit:
#1 Updated by Michael Kay over 8 years ago
My initial suspicion was that this might be a question of which Unicode version is in use, but using tables generated from Unicode 4.0.0 as against Unicode 6.2.0 does not appear to give any difference in the results, and indeed inspection of the UnicodeData.txt file suggests no obvious difference between versions for the relevant characters.
What does seem relevant is that (taking an example) the entry for 1F0E in UnicodeData.txt shows a decomposition to 1F08 0342, where 1F08 itself can be further decomposed to 0391 0313. It looks as if the Saxon data tables generated from UnicodeData.txt don't take this double decomposition into account.
#2 Updated by Michael Kay over 8 years ago
Note: The original Java code from the Unicode consortium on which the Java implementation is based has been withdrawn:
It's therefore possible that the code contains bugs which have not been fixed.
#4 Updated by Michael Kay over 8 years ago
Noted that in Saxon 9.1, decompose(codepoints-to-string((7944,834))) yields (913, 787, 834) whereas in 9.5, it yields (913, 834, 787). The failure to sort the modifiers into canonical order then causes a failure to compose the pairs during the composition phase.
Debugging shows that in 9.1, data.canonicalClass(834) is 230, while in 9.5, the same expression returns 220. This seems to account for the difference in the result of the sort. 834 = 0x342, for which Unicode 3.0 UnicodeData.txt and subsequent versions all have
0342;COMBINING GREEK PERISPOMENI;Mn;230;NSM;;;;;N;;;;;
where the field 230 is the canonical class. So the data tables in 9.5 appear to be wrong.
Please register to edit this issue