Bug #1991
closedrecursion error, or stack overflow, on match() or analyze-string
0%
Description
I'm getting fatal errors and exceptions with anchored regular expressions when running 9.5.1-4 which I did not get with earlier versions; it applies to both HE and EE as shipped in Oxygen 15.2 and also to HE as downloaded today.
Exception in thread "main" java.lang.StackOverflowError
at net.sf.saxon.z.IntHashMap.hash(IntHashMap.java:180)
at net.sf.saxon.z.IntHashMap.indexOf(IntHashMap.java:184)
at net.sf.saxon.z.IntHashMap.get(IntHashMap.java:90)
at net.sf.saxon.regex.REMatcher.beenHereBefore(REMatcher.java:458)
at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:228)
at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)
at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)
at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)
at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)
at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)
at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)
at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)
and so on.
<xsl:value-of select="replace($boy, '.m[^,],.*', 'blue', 's')" />
where $boy is a few kilobytes in length.
I've enclosed a greatly stripped down example that still goes wrong.
Running with java -Xss512m does make the problem go away for this input.
Files
Related issues
Updated by Michael Kay almost 11 years ago
Thanks for reporting it. This is an inefficiency rather than a bug. In 9.5 Saxon introduced a new regex engine (a partially-rewritten version of the Apache Jakarta engine) customized to handle the XPath regex dialect, replacing the previous mechanism where XPath regular expressions were translated to Java regular expressions. Unfortunately the optimization in this engine is not very sophisticated. It handles backtracking using recursion and this can lead to deep recursion proportional to the length of the input string. Part of the work in improving the engine for Saxon was to reduce the use of recursion in cases where no backtracking will be needed because the regex is unambiguous, so you can usually avoid the problem by eliminating unnecessary ambiguities in the regex. In your example I think the regex could be rewritten as replace($boy, '[^m]m[^,],.*', 'blue', 's')" which would hopefully solve the problem.
Updated by Liam Quin almost 11 years ago
Michael Kay wrote:
In your example I think the regex could be rewritten as replace($boy, '[^m]m[^,],.*', 'blue', 's')" which would hopefully solve the problem.
It might (thank you), but unfortunately wouldn't solve the real-world example - this was a cut-down version.
Updated by Michael Kay almost 11 years ago
Clearly a crude backtracking implementation will fail on this unless there is some optimization that prevents the failure. I haven't found any description of an optimization that would prevent this failing in the literature, so the only way to find out why JDK doesn't fail is probably to study the source code, which isn't a very attractive prospect. There are always going to be some expressions that one engine optimizes and another doesn't. One possibility here is that the engine could detect that if the subexpression "m[^,]*," is present anywhere in the string, then the entire string will be replaced; detecting the presence of a substring that matches this subexpression should be very fast. That's an easy optimization but it seems very focused on this particular case (especially as it depends on the "s" flag for its correctness) and it's not clear that it will fire often enough to make it worth implementing.
Updated by Michael Kay almost 11 years ago
- Status changed from New to In Progress
I've taken a look at how the JDK regex engine handles this, and basically it does the backtracking without such deep recursion. Given the sequence ".m", it first swallows the whole input sequence with the ., then discovers that "m" doesn't match, then goes backwards one character at a time until if finds a position at which "m" (or rather, the rest of the regex) does match. So it's basically doing one recursive call per atom in the regex, rather than one per character in the input.
Updated by Michael Kay almost 11 years ago
- Category set to Performance
- Assignee set to Michael Kay
I've been taking a fresh look at the design of the Jakarta engine (which I had modified in many details when incorporating it into Saxon, without changing the basic design). The design makes it quite hard to fix this because the engine keeps very little state information: essentially only a stack of which instruction is executing and the current position in the input; it's a very pure finite state machine.
Updated by Michael Kay over 10 years ago
I've implemented changes to the regex engine to improve the stack usage: see http://dev.saxonica.com/blog/mike/2014/02/another-regex-rewrite.html. These have been running successfully in the 9.6 development branch for a while. I will consider retrofitting the redesigned code to the 9.5 branch.
Updated by Michael Kay over 10 years ago
- Status changed from In Progress to Resolved
Marking this resolved because the plan is now to do nothing further; the new code will appear in 9.6 and will not be retrofitted to 9.5.
Updated by O'Neil Delpratt almost 10 years ago
- Status changed from Resolved to Closed
Please register to edit this issue