Bug #2166: Wrong result from fn:replace - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2166

closed

Wrong result from fn:replace

Added by Michael Kay over 10 years ago. Updated about 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

XPath conformance

Sprint/Milestone:

9.6.0.2

Start date:

2014-10-07

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.6

Fix Committed on Branch:

9.6

Fixed in Maintenance Release:

9.6.0.2

Platforms:

Description

Reported by Gunther Rademacher:

This expression,

 replace("12-34", "^\d+(-(\d+))?$", "$2")

with Saxon 9.6, returns "3", while 9.5 returned the correct

result of "34".

Actions

Copy link

Updated by Michael Kay over 10 years ago

Investigation so far: if we reinstate the commented out code at Operation line 1033, it works. So we need to find out why it was commented out.

What's happening here is that we look for a match of group 2 (\d+) at position 3, and find one, setting the value of group 2 to "34". For some reason, the code then backtracks, looking for further matches; it finds another match of group 2 at "3" and sets this as the value of the group. The match at "3" is ultimately unsuccessful but by then the damage has been done.

In this particular case the end of the repetition is unambiguous (it's defined by the next match in sequence, "$") so it's not clear why we are backtracking at all.

Actions

Copy link

Updated by Michael Kay over 10 years ago

The test for unambiguous repeats is only handling very simple cases, in particular it's not handling any case where a sequence is repeated, such as (-\d+)?$. This is not relevant to the problem, however, as it needs to work in cases where the repeat is ambiguous as well.

The root cause of this problem is pretty deep. We're essentially searching a tree of possible matches, and each operator is deciding how best to conduct its search locally. But the search has side-effects, of recording the positions of captured groups, and the final result is therefore sensitive to the ordering. There's no logic, apart from controlling the order of searches, that ensures that a captured group isn't recorded for a local match that turns out to be a "cul-de-sac", ie. one that doesn't result in a match for the whole expression. A nicer design would avoid doing side-effecting updates on the list of captured groups, and instead return this information up the tree of operations as part of the match data (currently the only thing returned in the match data is a single integer, the match position). That's rather a radical redesign to contemplate just now; instead we should concentrate on getting the order of searching right such that the successful search is always the last one.

Actions

Copy link

Updated by Michael Kay over 10 years ago

I think the problem is in ForceProgressiterator, which does a one-step lookahead. The lookahead is calling next() on the iterator for a subexpression, and the call on next() has the side-effect of setting the value of the matching group. So the group is set to the characters that would be matched next if the successful match were unsuccessful.

ForceProgressiterator is a wrapping iterator that ensures that a zero-length match is only returned once at any position, in a construct like (A?)*. I confirmed that removing the ForceProgressIterator in this case solves the problem. (But of course for other regular expressions it is needed to prevent an infinite loop).

I've tried a rewritten ForceProgressiterator that doesn't do lookahead, and iI'm seeing two XSLT tests to fail: analyze-string-050 and regex-059. On investigation it turned out these failures could be attributed to an unrelated "improvement" to the optimization code that I made earlier while investigating this bug. With this "improvement" reverted, I get a clean run on both the QT3 and XSLT3 test suites, so I will commit the fix. I will add the test case to QT3 fn-replace.

Actions

Copy link