Bug #757
closedcontains() with an accent-blind collation
0%
Description
SourceForge user: mhkay
When contains() and other similar functions are used
with an accent-blind collation, accents are not ignored
as they should be. For example,
contains("télé", "tele",
"http://saxon.sf.net/collation?lang=fr-FR;strength=primary")
returns false.
The reason for the problem is an undocumented behaviour
of the JDK RuleBasedCollator class: with this kind of
collation, the stream of collation elements returned by
the CollationElementIterator includes zero values where
the accents occur, and the application (i.e. Saxon) is
apparently expected to ignore these zero values. The
attached file is a new version of
net.sf.saxon.sort.RuleBaseSubstringMatcher modified to
behave this way.
The functions affected are contains, starts-with,
ends-with, substring-before, and substring-after.
Files
Updated by Anonymous over 18 years ago
SourceForge user: mhkay
Logged In: YES
user_id=251681
Note that this change has some unexpected consequences. For
example in a collation with strength=primary, "-" is an
ignorable character for collation purposes, and is therefore
represented in the sequence of collation units by a zero
value. The effect is that substring-before("in-scope", "-")
returns "", because the "-" matches an empty string. This
behaviour, though strange, is correct according to the spec.
Please register to edit this issue