Wrong result from fn:replace
Reported by Gunther Rademacher:
replace("12-34", "^\d+(-(\d+))?$", "$2")
with Saxon 9.6, returns "3", while 9.5 returned the correct
result of "34".
#1 Updated by Michael Kay almost 5 years ago
Investigation so far: if we reinstate the commented out code at Operation line 1033, it works. So we need to find out why it was commented out.
What's happening here is that we look for a match of group 2 (\d+) at position 3, and find one, setting the value of group 2 to "34". For some reason, the code then backtracks, looking for further matches; it finds another match of group 2 at "3" and sets this as the value of the group. The match at "3" is ultimately unsuccessful but by then the damage has been done.
In this particular case the end of the repetition is unambiguous (it's defined by the next match in sequence, "$") so it's not clear why we are backtracking at all.
#2 Updated by Michael Kay almost 5 years ago
The test for unambiguous repeats is only handling very simple cases, in particular it's not handling any case where a sequence is repeated, such as (-\d+)?$. This is not relevant to the problem, however, as it needs to work in cases where the repeat is ambiguous as well.
The root cause of this problem is pretty deep. We're essentially searching a tree of possible matches, and each operator is deciding how best to conduct its search locally. But the search has side-effects, of recording the positions of captured groups, and the final result is therefore sensitive to the ordering. There's no logic, apart from controlling the order of searches, that ensures that a captured group isn't recorded for a local match that turns out to be a "cul-de-sac", ie. one that doesn't result in a match for the whole expression. A nicer design would avoid doing side-effecting updates on the list of captured groups, and instead return this information up the tree of operations as part of the match data (currently the only thing returned in the match data is a single integer, the match position). That's rather a radical redesign to contemplate just now; instead we should concentrate on getting the order of searching right such that the successful search is always the last one.
#3 Updated by Michael Kay almost 5 years ago
I think the problem is in ForceProgressiterator, which does a one-step lookahead. The lookahead is calling next() on the iterator for a subexpression, and the call on next() has the side-effect of setting the value of the matching group. So the group is set to the characters that would be matched next if the successful match were unsuccessful.
ForceProgressiterator is a wrapping iterator that ensures that a zero-length match is only returned once at any position, in a construct like (A?)*. I confirmed that removing the ForceProgressIterator in this case solves the problem. (But of course for other regular expressions it is needed to prevent an infinite loop).
I've tried a rewritten ForceProgressiterator that doesn't do lookahead, and iI'm seeing two XSLT tests to fail: analyze-string-050 and regex-059. On investigation it turned out these failures could be attributed to an unrelated "improvement" to the optimization code that I made earlier while investigating this bug. With this "improvement" reverted, I get a clean run on both the QT3 and XSLT3 test suites, so I will commit the fix. I will add the test case to QT3 fn-replace.
Please register to edit this issue