Bug #2419
closedSaxon transformation never ends when processing regular expressions
100%
Description
Hello,
An oXygen user reported a problem regarding Saxon transformation that never ends. I reproduced the problem and I created a simple XSL and XML that are attached to this issue.
The problem it seems to be when regular expressions are processed. Also the XML file contains a paragraph with some German characters.
I reproduced this with the latest Saxon 9.6.0.6., but it works fine with Saxon 9.5.1.7.
Best Regards,
Octavian
Files
Updated by Michael Kay over 9 years ago
- Status changed from New to In Progress
It does theoretically terminate, as can be established by trying it on a much shorter string. The problem is that the backtracking makes it exponential in the length of the string. That is, if matching a one-character string takes 1 microsecond, then matching a 137 character string takes 2^137 microseconds, which takes you well beyond the expected life of the solar system.
I'll have a look to see if there is an optimization that can be applied. An obvious one is to extract the precondition matches(., '\d{4}'), but that only helps for strings that don't match this precondition. On the other hand, a string that does match '\d{4}' will match quickly anyway, because the backtracking never kicks in.
The backtracking algorithm was rewritten in 9.6, largely to avoid exploding the amount of stack space needed rather than to reduce time. There's a space-time trade-off involved here and we may have swung too far the other way...
There's almost certainly a way of rewriting the regex that will perform much better. As far as I can see the leading pattern ^(\S+[\s,])+ matches every string that starts with a non-space character, that is, it's equivalent to ^\S. The problem is that a typical string matches the pattern in many different ways, and we are trying them all.
Another approach we could use here is to augment the "history" mechanism that we currently use to avoid looping when we try to match a zero-length string repeatedly at the same position. Again there's a bit of space-time tradeoff here, but if we were to remember the substrings that ^(\S+[\s,]*)+ has already matched, then we don't need to repeat the same match attempt during backtracking. But this might only work in cases where there's no need to remember the captured substrings.
Updated by Michael Kay over 9 years ago
The effect is described here:
Updated by Michael Kay over 9 years ago
And the classic article on the dangers of backtracking in regex engines is here:
https://swtch.com/~rsc/regexp/regexp1.html
(It talks about the benefits of the alternative Thompson NFA algorithm, but the problem is that this doesn't handle backreferences and captured groups).
And the use of "history" to reduce the cost of backtracking (actually to reduce the worst-case cost at some expense in average-case cost) is described here:
Updated by Michael Kay over 9 years ago
- Category set to Performance
- Assignee set to Michael Kay
- Found in version set to 9.6
For 9.7 I have added a system of preconditions to the regex matcher. In this particular case (regex='^(\S+[\s,]*)+\d{4}'), the pattern \d{4} is recognized as a precondition, and if this does not match somewhere in the input string, the entire regex returns false. This avoids the exponential backtracking that otherwise occurs in the non-matching case.
Updated by Michael Kay about 9 years ago
- Status changed from In Progress to Resolved
- Fixed in version set to 9.7
Marking this as "resolved" on the basis that improvements have been made in 9.7 which will benefit some cases including this particular one. No change for 9.6.
Updated by O'Neil Delpratt almost 9 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in version changed from 9.7 to 9.6.0.8
Updated by O'Neil Delpratt almost 9 years ago
- Fixed in version deleted (
9.6.0.8)
Bug fix went out in the 9.7 release only
Please register to edit this issue