Project

Profile

Help

Bug #3147

Collations

Added by Michael Kay over 2 years ago. Updated 3 months ago.

Status:
In Progress
Priority:
Low
Sprint/Milestone:
-
Start date:
2017-02-27
Due date:
% Done:

0%

Applies to JS Branch:
1.0, Trunk
Fix Committed on JS Branch:
Trunk
Fixed in JS Release:
SEF Generated with:
Company:
-
Contact person:
-
Additional contact persons:
-

Description

The default collation is ignored for value comparisons and general comparisons. The documentation doesn't make this clear.

The current logic for codepointCollation.equals() is very inefficient because it first tests if the string contains non-BMP (astral) characters. This is necessary only for lt and gt comparisons; for eq and ne the standard Javascript String comparison works fine (I think).

History

#1 Updated by Debbie Lockett about 2 years ago

  • Status changed from New to In Progress
  • Found in version set to 1.0.0

The Saxon-JS code for handling collations has had a lot of work since the 1.0.0 release. This includes addressing the problems raised in this bug:

  1. The "comparer" objects (which have compare and equals methods) used for value comparisons and general comparisons now check for a specified collation, using the codepoint collation as default.

  2. The codepointCollation.equals() method has been updated as suggested.

Also, "collation" objects now have a standard format, and can be supplied to the SaxonJS.transform() call using the "collations" option. This takes a map (JS object) from collation URIs to collations. A collation is an object with certain methods (where the arguments are JS strings): equals (mandatory), compare, collationKey, contains, startsWith, endsWith, and indexOf.

The following collations are now implemented in Saxon-JS: unicode codepoint collation (http://www.w3.org/2005/xpath-functions/collation/codepoint), and the HTML ASCII case-insensitive collation (http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive).

The use of Unicode Collation Algorithm collations (those beginning http://www.w3.org/2013/collation/UCA) is not yet implemented. Currently if one of these collations is specified, then Saxon-JS always just uses the default codepoint collation.

The documentation needs to be updated with the above information for the next release.

#2 Updated by Michael Kay about 2 years ago

For UCA collations (and when lang="XX" is specified in xsl:sort), we should attempt to use Intl.Collator to the extent it is available in the browser.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Collator

Query parameters in the UCA collation URI should be handled as follows:

  • fallback - if fallback=no, reject as "unsupported collation"

  • lang - use as first argument to Intl.Collator

  • strength - use to set the sensitivity option

  • caseFirst - use to set the kf option

  • numeric - use to set the kn option

  • alternate - interpret alternate=blanked as ignore-punctuation=true.

The resulting Collator object supports compare() (and therefore equals()), but not contains, startsWith, etc; and there's no mechanism for getting collation keys so it can't be used in distinct-values() and grouping. Unless perhaps we implement group-by by doing a sort followed by group-adjacent.

#3 Updated by Michael Kay about 2 years ago

Also of course we should recognize xsl:sort lang="de" and interpret it as a request for the corresponding UCA collation. See tweet from XMLArbyter today.

#4 Updated by Debbie Lockett about 2 years ago

  • Applies to JS Branch 1.0, Trunk added

#5 Updated by Debbie Lockett 3 months ago

  • Description updated (diff)
  • Fix Committed on JS Branch Trunk added

Committed changes on the 2.0 trunk branch to implement support for UCA collations as suggested above (including when lang="XX" is specified in xsl:sort, see change in Compare.sortKeyProps ).

When the only query parameter used in a UCA collation URI is "strength=secondary" (or "strength=2"), then a full caseblind collation is used (Compare.caseblind), to support contains, startsWith, etc. as well as equals and compare.

Otherwise, the collation object uses Intl.Collator, and so only supports equals and compare (so has restricted use as described above).

Details for handling strength parameter:

  • primary|1 => sensitivity: "base"
  • secondary|2 => sensitivity: "accent"
  • tertiary|3 => sensitivity: "variant"
  • quaternary|4|identical|5 => sensitivity: "variant", ignorePunctuation: "false"

The following query parameters are ignored: version, maxVariable, backwards, normalization, caseLevel, reorder; and alternate is only supported if alternate=blanked.

Please register to edit this issue

Also available in: Atom PDF Tracking page