Project

Profile

Help

Support #6384

closed

Inconsistent results for substring-after() in XQuery

Added by Mircea Enachescu 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
2024-04-08
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
12
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:
Java

Description

The results of expression "substring-after($who, '#')" is different when used for transformation Saxon-HE versus Saxon-EE. For input string "#GA" Saxon-HE returns "GA" while Saxon-EE leaves the input string unaltered. Attached files for reproducing the behvior. Please note that is not available for XSLT, where both versions of Saxon succeed.


Files

substring-after.xquery (643 Bytes) substring-after.xquery Mircea Enachescu, 2024-04-08 14:16
substring-after.xml (132 Bytes) substring-after.xml Mircea Enachescu, 2024-04-08 14:16
substring-after.xsl (670 Bytes) substring-after.xsl Mircea Enachescu, 2024-04-08 14:16
Actions #1

Updated by Michael Kay 8 months ago

The XQuery code declares a default collation, which the XSLT code doesn't. It's almost certainly the collation that accounts for the EE/HE difference, because Saxon-EE uses ICU for collation support whereas Saxon-HE uses the native JDK libraries.

The semantics of substring functions in the presence of a default collation are fairly peculiar because they cause certain characters to be treated as ignorable. Unless you really want a collation-sensitive substring, I would avoid this area.

Actions #2

Updated by Michael Kay 8 months ago

It's worth reading

https://www.w3.org/TR/xpath-functions-31/#substring.functions

for an explanation of what is happening here.

The substring-after function says:

The function returns the substring of the value of $arg1 that follows in the value of $arg1 the first occurrence of a sequence of collation units that provides a minimal match to the collation units of $arg2 according to the collation that is used.

If the second argument of substring-after() is a character (such as '#') that is ignored for collation purposes, then the sequence of collation units for the second argument is empty, which means that it matches at the beginning of the string, which means that the entire string is returned.

Actions #3

Updated by Mircea Enachescu 8 months ago

Thanks for explanation !

Actions #4

Updated by Michael Kay 8 months ago

  • Tracker changed from Bug to Support
  • Status changed from New to Closed

Please register to edit this issue

Also available in: Atom PDF