Project

Profile

Help

Bug #757

closed

contains() with an accent-blind collation

Added by Anonymous about 18 years ago. Updated about 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Legacy ID:
sf-1444006
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

SourceForge user: mhkay

When contains() and other similar functions are used

with an accent-blind collation, accents are not ignored

as they should be. For example,

contains("télé", "tele",

"http://saxon.sf.net/collation?lang=fr-FR;strength=primary")

returns false.

The reason for the problem is an undocumented behaviour

of the JDK RuleBasedCollator class: with this kind of

collation, the stream of collation elements returned by

the CollationElementIterator includes zero values where

the accents occur, and the application (i.e. Saxon) is

apparently expected to ignore these zero values. The

attached file is a new version of

net.sf.saxon.sort.RuleBaseSubstringMatcher modified to

behave this way.

The functions affected are contains, starts-with,

ends-with, substring-before, and substring-after.


Files

RuleBasedSubstringMatcher.java (9.56 KB) RuleBasedSubstringMatcher.java Anonymous, 2006-03-06 10:26
Actions #1

Updated by Anonymous about 18 years ago

SourceForge user: mhkay

Logged In: YES

user_id=251681

Note that this change has some unexpected consequences. For

example in a collation with strength=primary, "-" is an

ignorable character for collation purposes, and is therefore

represented in the sequence of collation units by a zero

value. The effect is that substring-before("in-scope", "-")

returns "", because the "-" matches an empty string. This

behaviour, though strange, is correct according to the spec.

Please register to edit this issue

Also available in: Atom PDF