Project

Profile

Help

Bug #1842

closed

Greek perispomeni and normalize-unicode

Added by Michael Kay almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2013-07-14
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Raised by Ryan Baumann on the SourceForge saxon-help list.

Various forms of characters with perispomeni seem to be handled

incorrectly with normalize-unicode (running as XSLT 2.0 in Saxon HE

9.5.1.1).

normalize-unicode('ῇ̓','NFC') (U+03B7 U+0342 U+0313 U+0345) is ῇ̓

(U+1FC6 U+0313 U+0345)

correct NFC: ῇ̓ (U+1FC7 U+0313)

normalize-unicode('ῇ̓','NFD') (U+1FC7 U+0313) is ῇ̓ (U+03B7 U+0342

U+0345 U+0313)

normalize-unicode('ῇ̓','NFD') (U+1FC6 U+0313 U+0345) is ῇ̓ (U+03B7

U+0342 U+0313 U+0345)

Other instances of incorrect NFC normalization (normalize-unicode on

these characters is idempotent):

ῇ̔ ‎(U+1FC6 U+0314 U+0345) should be ῇ̔ (U+1FC7 U+0314)

ῷ̔ (U+1FF6 U+0314 U+0345) should be ῷ̔ (U+1FF7 U+0314)

Ὧ (U+1F69 U+0342) should be Ὧ (U+1F6F)

Ἆ (U+1F08 U+0342) should be Ἆ (‎U+1F0E)

Checked against both Java's java.text.Normalizer and Perl's

Unicode::Normalize as my references for "correct" NFC normalization.

The problem seems to be fairly general for any character which has a

pre-combined perispomeni form. There are probably others than just

what's here, you can see the results of running java.text.Normalizer

against a large corpus of Ancient Greek that has already been passed

through normalize-unicode in this commit:

https://github.com/ryanfb/idp.data/commit/bcb7dd6223fb50c48f62027761e8deced2574ed7

-Ryan

Please register to edit this issue

Also available in: Atom PDF