Project

Profile

Help

Support #4577

Understanding collation order and the respect for spaces

Added by Ken Holman 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Localization
Sprint/Milestone:
-
Start date:
2020-06-06
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:

Description

Hi, folks!

In my UBL work I'm switching from Unicode sort order to language sort order in order to sort capitalized abbreviations in order along with leading-capital words. Of course Unicode sort order puts caps in front of lower case and so something has to change.

However, when I change to a language sort, the spaces are being ignored in the collation. I'm using 9HE, is there a collation that respects spaces in sort order in front of letters? I've also tried strength=primary and strength=secondary but, again, the spaces are ignored.

Below is a transcript of the attached test file. I'm trying to get "ISPS Requirements" to sort after "Invoice Line".

Thank you for your guidance!

. . . . . Ken

~/t $ cat sort.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsd"
  version="2.0">

<xsl:output method="text"/>
  
<xsl:template match="/">
  <xsl:variable name="strings" as="xsd:string*"
    select="('Request For Tender Line','Requested Tender Total',
             'Order Reference', 'Ordered Shipment',
             'Invoice Line', 'ISPS Requirements', 'Item',
             'Tender Result', 'Tendered Project',
             'Transportation Segment', 'Transport Schedule')"/>
  <xsl:text>Unicode order&#xa;&#xa;</xsl:text>
  <xsl:for-each select="$strings">
    <xsl:sort/>
    <xsl:value-of select="."/><xsl:text>&#xa;</xsl:text>
  </xsl:for-each>  
  <xsl:text>&#xa;Collation order </xsl:text>
  <xsl:text>http://saxon.sf.net/collation?lang=en&#xa;&#xa;</xsl:text>
  <xsl:for-each select="$strings">
    <xsl:sort
           default-collation="http://saxon.sf.net/collation?lang=en"/>
    <xsl:value-of select="."/><xsl:text>&#xa;</xsl:text>
  </xsl:for-each>  
  
  
</xsl:template>
  
</xsl:stylesheet>
~/t $ xslt2 sort.xsl sort.xsl 
Unicode order

ISPS Requirements
Invoice Line
Item
Order Reference
Ordered Shipment
Request For Tender Line
Requested Tender Total
Tender Result
Tendered Project
Transport Schedule
Transportation Segment

Collation order http://saxon.sf.net/collation?lang=en

Invoice Line
ISPS Requirements
Item
Ordered Shipment
Order Reference
Requested Tender Total
Request For Tender Line
Tendered Project
Tender Result
Transportation Segment
Transport Schedule
~/t $ 

sort.xsl (1.1 KB) sort.xsl Ken Holman, 2020-06-06 02:16

History

#1 Updated by Ken Holman 2 months ago

p.s. the reason I'm focusing on default-collation= is that in my actual code I'm using ">" and "<" comparisons and not doing an actual xsl:sort.

#2 Updated by Michael Kay 2 months ago

First point is that in Saxon-HE, we're using collations supplied by the underlying platform, presumably Java in this case. That's different from PE/HE, where we will use ICU-J if available. So for this example you're going to get the standard Java Collator for lang=en. Which, it appears, ignores spaces.

So there are two steps to solving your problem: first, work out how to customise the Java Collator. Second, work out how to use the customised Collator in Saxon.

A StackOverflow post that addresses the Java side of this, with some useful links, can be found here: https://stackoverflow.com/questions/16567287/java-collation-ignores-space This shows how to create a Java RuleBasedCollator that does what you want. (Caveat, I haven't tried this, and I know that playing in this area can be frustrating.)

There are several ways you can integrate this RuleBasedCollator into Saxon. One is to use a collation URI that has "rules=..." in its query part, but to be honest, that's pretty clunky. A cleaner way is probably with Configuration.setCollationURIResolver(...), where you can supply a CollationURIResolver that recognises your own collation URIs. It's required to return a StringCollator, which is a Saxon interface, and one of the implementations is RuleBasedSubstringMatcher, which wraps a Java RuleBasedCollator. (If you're given a collation URI you don't recognize, you can delegate back to Saxon's StandardCollationURIResolver.)

#3 Updated by Michael Kay 2 months ago

If you're using UCA collations, by the way, I think the syntax would be

collation="http://www.w3.org/2013/collation/UCA?alternate=non-ignorable"

(Intuitive, isn't it? I18N experts live in a world of their own...)

But I don't think this will be recognised in Saxon-HE.

#4 Updated by Ken Holman 2 months ago

Thank you for taking from your time, Michael, to spell this out for me. Not being a Java programmer, my options are limited. And the work I'm doing is made publicly available for committee use. The committee will have to live with Unicode code-point order.

It is very generous of you to help explain this. I hope you and yours stay safe and well.

. . . . . Ken

#5 Updated by Michael Kay 2 months ago

Looking at this

Requested Tender Total
Request For Tender Line

Another way of thinking it is as word-by-word collation. You can achieve that in 3.0 using fn:sort($things, $collation, function($thing){tokenize($thing)}) which is basically sorting on the first word, then sorting on the second word if the first word is the same, and so on.

Give that a try.

#6 Updated by Michael Kay about 1 month ago

  • Category set to Localization
  • Status changed from New to Closed
  • Assignee set to Michael Kay

Please register to edit this issue

Also available in: Atom PDF