Bug #3429: Regular expression in fn:replace does not match (but should) - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #3429

closed

Regular expression in fn:replace does not match (but should)

Added by Stefan Pöschel about 7 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

XPath conformance

Sprint/Milestone:

Start date:

2017-09-06

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.8

Fix Committed on Branch:

9.8

Fixed in Maintenance Release:

9.8.0.5

Platforms:

Description

Hello,

I have an XSLT with an Regex that extracts the unit from a numeric value e.g. "10%" results in "%".

With e.g. Saxon HE 9.6.0.7 the result is as expected:

<?xml version="1.0" encoding="UTF-8"?>
<unit>%</unit>

With Saxon HE 9.8.0.4 however the result is different:

<?xml version="1.0" encoding="UTF-8"?>
<unit>10%</unit>

OS is Linux, but another user experiences the problem under Windows as well.

I attached a minimal example XSLT, which may also be used as its own input; command line is:

java -jar saxon9he.jar -s:regex_min.xsl -xsl:regex_min.xsl

Files

regex_min.xsl (300 Bytes) regex_min.xsl

Stefan Pöschel, 2017-09-06 16:43

Actions

Copy link

Updated by Michael Kay about 7 years ago

Confirmed that there appears to be a regression here between 9.7 and 9.8.

Actions

Copy link

Updated by Michael Kay about 7 years ago

Added to QT3 test suite as test fn-replace-56.

Actions

Copy link

Updated by Michael Kay about 7 years ago

Category set to XPath conformance
Status changed from New to Resolved
Assignee set to Michael Kay

The Saxon regex engine, given a sequence containing a repeatable term (\d*) followed by another term (.?) attempts to establish whether the boundary is unambiguous: that is, whether given a particular character in the input, it is possible to determine unambigously whether it belongs to the first term or the second. Because this eliminates the need for backtracking it can deliver substantial performance improvements: the test for unambiguity was therefore improved in 9.8. But it has wrongly decided that this case is unambiguous, because although a digit cannot match the second term (.) it can match the third (\d+), and the second term is allowed to be empty.

Ideally we should check whether a character that matches the Nth term can also match any subsequent term, allowing for the fact that some of the subsequent terms can match an empty string. For the present, however, I will fix it so that the match is considered ambiguous if the second term allows a repeat count of zero.

Actions

Copy link