Bug #1842
closedGreek perispomeni and normalize-unicode
100%
Description
Raised by Ryan Baumann on the SourceForge saxon-help list.
Various forms of characters with perispomeni seem to be handled
incorrectly with normalize-unicode (running as XSLT 2.0 in Saxon HE
9.5.1.1).
normalize-unicode('ῇ̓','NFC') (U+03B7 U+0342 U+0313 U+0345) is ῇ̓
(U+1FC6 U+0313 U+0345)
correct NFC: ῇ̓ (U+1FC7 U+0313)
normalize-unicode('ῇ̓','NFD') (U+1FC7 U+0313) is ῇ̓ (U+03B7 U+0342
U+0345 U+0313)
normalize-unicode('ῇ̓','NFD') (U+1FC6 U+0313 U+0345) is ῇ̓ (U+03B7
U+0342 U+0313 U+0345)
Other instances of incorrect NFC normalization (normalize-unicode on
these characters is idempotent):
ῇ̔ (U+1FC6 U+0314 U+0345) should be ῇ̔ (U+1FC7 U+0314)
ῷ̔ (U+1FF6 U+0314 U+0345) should be ῷ̔ (U+1FF7 U+0314)
Ὧ (U+1F69 U+0342) should be Ὧ (U+1F6F)
Ἆ (U+1F08 U+0342) should be Ἆ (U+1F0E)
Checked against both Java's java.text.Normalizer and Perl's
Unicode::Normalize as my references for "correct" NFC normalization.
The problem seems to be fairly general for any character which has a
pre-combined perispomeni form. There are probably others than just
what's here, you can see the results of running java.text.Normalizer
against a large corpus of Ancient Greek that has already been passed
through normalize-unicode in this commit:
https://github.com/ryanfb/idp.data/commit/bcb7dd6223fb50c48f62027761e8deced2574ed7
-Ryan
Please register to edit this issue