Support #4577
closedUnderstanding collation order and the respect for spaces
0%
Description
Hi, folks!
In my UBL work I'm switching from Unicode sort order to language sort order in order to sort capitalized abbreviations in order along with leading-capital words. Of course Unicode sort order puts caps in front of lower case and so something has to change.
However, when I change to a language sort, the spaces are being ignored in the collation. I'm using 9HE, is there a collation that respects spaces in sort order in front of letters? I've also tried strength=primary and strength=secondary but, again, the spaces are ignored.
Below is a transcript of the attached test file. I'm trying to get "ISPS Requirements" to sort after "Invoice Line".
Thank you for your guidance!
. . . . . Ken
~/t $ cat sort.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xsd"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:variable name="strings" as="xsd:string*"
select="('Request For Tender Line','Requested Tender Total',
'Order Reference', 'Ordered Shipment',
'Invoice Line', 'ISPS Requirements', 'Item',
'Tender Result', 'Tendered Project',
'Transportation Segment', 'Transport Schedule')"/>
<xsl:text>Unicode order

</xsl:text>
<xsl:for-each select="$strings">
<xsl:sort/>
<xsl:value-of select="."/><xsl:text>
</xsl:text>
</xsl:for-each>
<xsl:text>
Collation order </xsl:text>
<xsl:text>http://saxon.sf.net/collation?lang=en

</xsl:text>
<xsl:for-each select="$strings">
<xsl:sort
default-collation="http://saxon.sf.net/collation?lang=en"/>
<xsl:value-of select="."/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
~/t $ xslt2 sort.xsl sort.xsl
Unicode order
ISPS Requirements
Invoice Line
Item
Order Reference
Ordered Shipment
Request For Tender Line
Requested Tender Total
Tender Result
Tendered Project
Transport Schedule
Transportation Segment
Collation order http://saxon.sf.net/collation?lang=en
Invoice Line
ISPS Requirements
Item
Ordered Shipment
Order Reference
Requested Tender Total
Request For Tender Line
Tendered Project
Tender Result
Transportation Segment
Transport Schedule
~/t $
Files
Updated by Ken Holman over 4 years ago
p.s. the reason I'm focusing on default-collation= is that in my actual code I'm using ">" and "<" comparisons and not doing an actual xsl:sort.
Updated by Michael Kay over 4 years ago
First point is that in Saxon-HE, we're using collations supplied by the underlying platform, presumably Java in this case. That's different from PE/HE, where we will use ICU-J if available. So for this example you're going to get the standard Java Collator for lang=en. Which, it appears, ignores spaces.
So there are two steps to solving your problem: first, work out how to customise the Java Collator. Second, work out how to use the customised Collator in Saxon.
A StackOverflow post that addresses the Java side of this, with some useful links, can be found here: https://stackoverflow.com/questions/16567287/java-collation-ignores-space This shows how to create a Java RuleBasedCollator
that does what you want. (Caveat, I haven't tried this, and I know that playing in this area can be frustrating.)
There are several ways you can integrate this RuleBasedCollator
into Saxon. One is to use a collation URI that has "rules=..." in its query part, but to be honest, that's pretty clunky. A cleaner way is probably with Configuration.setCollationURIResolver(...)
, where you can supply a CollationURIResolver
that recognises your own collation URIs. It's required to return a StringCollator
, which is a Saxon interface, and one of the implementations is RuleBasedSubstringMatcher
, which wraps a Java RuleBasedCollator.
(If you're given a collation URI you don't recognize, you can delegate back to Saxon's StandardCollationURIResolver
.)
Updated by Michael Kay over 4 years ago
If you're using UCA collations, by the way, I think the syntax would be
collation="http://www.w3.org/2013/collation/UCA?alternate=non-ignorable"
(Intuitive, isn't it? I18N experts live in a world of their own...)
But I don't think this will be recognised in Saxon-HE.
Updated by Ken Holman over 4 years ago
Thank you for taking from your time, Michael, to spell this out for me. Not being a Java programmer, my options are limited. And the work I'm doing is made publicly available for committee use. The committee will have to live with Unicode code-point order.
It is very generous of you to help explain this. I hope you and yours stay safe and well.
. . . . . Ken
Updated by Michael Kay over 4 years ago
Looking at this
Requested Tender Total
Request For Tender Line
Another way of thinking it is as word-by-word collation. You can achieve that in 3.0 using fn:sort($things, $collation, function($thing){tokenize($thing)})
which is basically sorting on the first word, then sorting on the second word if the first word is the same, and so on.
Give that a try.
Updated by Michael Kay over 4 years ago
- Category set to Localization
- Status changed from New to Closed
- Assignee set to Michael Kay
Please register to edit this issue