Project

Profile

Help

Bug #3429

closed

Regular expression in fn:replace does not match (but should)

Added by Stefan Pöschel over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2017-09-06
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8
Fix Committed on Branch:
9.8
Fixed in Maintenance Release:
Platforms:

Description

Hello,

I have an XSLT with an Regex that extracts the unit from a numeric value e.g. "10%" results in "%".

With e.g. Saxon HE 9.6.0.7 the result is as expected:

<?xml version="1.0" encoding="UTF-8"?>
<unit>%</unit>

With Saxon HE 9.8.0.4 however the result is different:

<?xml version="1.0" encoding="UTF-8"?>
<unit>10%</unit>

OS is Linux, but another user experiences the problem under Windows as well.

I attached a minimal example XSLT, which may also be used as its own input; command line is:

java -jar saxon9he.jar -s:regex_min.xsl -xsl:regex_min.xsl

Files

regex_min.xsl (300 Bytes) regex_min.xsl Stefan Pöschel, 2017-09-06 16:43
Actions #1

Updated by Michael Kay over 6 years ago

Confirmed that there appears to be a regression here between 9.7 and 9.8.

Actions #2

Updated by Michael Kay over 6 years ago

Added to QT3 test suite as test fn-replace-56.

Actions #3

Updated by Michael Kay over 6 years ago

  • Category set to XPath conformance
  • Status changed from New to Resolved
  • Assignee set to Michael Kay

The Saxon regex engine, given a sequence containing a repeatable term (\d*) followed by another term (.?) attempts to establish whether the boundary is unambiguous: that is, whether given a particular character in the input, it is possible to determine unambigously whether it belongs to the first term or the second. Because this eliminates the need for backtracking it can deliver substantial performance improvements: the test for unambiguity was therefore improved in 9.8. But it has wrongly decided that this case is unambiguous, because although a digit cannot match the second term (.) it can match the third (\d+), and the second term is allowed to be empty.

Ideally we should check whether a character that matches the Nth term can also match any subsequent term, allowing for the fact that some of the subsequent terms can match an empty string. For the present, however, I will fix it so that the match is considered ambiguous if the second term allows a repeat count of zero.

Actions #4

Updated by Stefan Pöschel over 6 years ago

Thank you!

Actions #5

Updated by O'Neil Delpratt over 6 years ago

  • % Done changed from 0 to 100
  • Fix Committed on Branch 9.8 added
Actions #6

Updated by O'Neil Delpratt over 6 years ago

  • Status changed from Resolved to Closed
  • Fixed in Maintenance Release 9.8.0.5 added

Bug fix applied in the Saxon 9.8.0.5 maintenance release.

Please register to edit this issue

Also available in: Atom PDF