Bug #1991: recursion error, or stack overflow, on match() or analyze-string - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #1991

closed

recursion error, or stack overflow, on match() or analyze-string

Added by Liam Quin over 10 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Low

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2014-01-28

Due date:

% Done:

Estimated time:

Legacy ID:

Applies to branch:

Fix Committed on Branch:

Fixed in Maintenance Release:

Platforms:

Description

I'm getting fatal errors and exceptions with anchored regular expressions when running 9.5.1-4 which I did not get with earlier versions; it applies to both HE and EE as shipped in Oxygen 15.2 and also to HE as downloaded today.

Exception in thread "main" java.lang.StackOverflowError

    at net.sf.saxon.z.IntHashMap.hash(IntHashMap.java:180)

    at net.sf.saxon.z.IntHashMap.indexOf(IntHashMap.java:184)

    at net.sf.saxon.z.IntHashMap.get(IntHashMap.java:90)

    at net.sf.saxon.regex.REMatcher.beenHereBefore(REMatcher.java:458)

    at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:228)

    at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)

    at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)

    at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)

    at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)

    at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)

    at net.sf.saxon.regex.Operation$OpStar.exec(Operation.java:235)

    at net.sf.saxon.regex.REMatcher.matchNodes(REMatcher.java:413)

and so on.

<xsl:value-of select="replace($boy, '.m[^,],.*', 'blue', 's')" />

where $boy is a few kilobytes in length.

I've enclosed a greatly stripped down example that still goes wrong.

Running with java -Xss512m does make the problem go away for this input.

Files

saxonbug.xsl (7.79 KB) saxonbug.xsl

Liam Quin, 2014-01-29 02:29

Related issues

Actions

Copy link

Updated by Michael Kay over 10 years ago

Thanks for reporting it. This is an inefficiency rather than a bug. In 9.5 Saxon introduced a new regex engine (a partially-rewritten version of the Apache Jakarta engine) customized to handle the XPath regex dialect, replacing the previous mechanism where XPath regular expressions were translated to Java regular expressions. Unfortunately the optimization in this engine is not very sophisticated. It handles backtracking using recursion and this can lead to deep recursion proportional to the length of the input string. Part of the work in improving the engine for Saxon was to reduce the use of recursion in cases where no backtracking will be needed because the regex is unambiguous, so you can usually avoid the problem by eliminating unnecessary ambiguities in the regex. In your example I think the regex could be rewritten as replace($boy, '[^m]m[^,],.*', 'blue', 's')" which would hopefully solve the problem.

Actions

Copy link

Updated by Liam Quin over 10 years ago

Michael Kay wrote:

In your example I think the regex could be rewritten as replace($boy, '[^m]m[^,],.*', 'blue', 's')" which would hopefully solve the problem.

It might (thank you), but unfortunately wouldn't solve the real-world example - this was a cut-down version.

Actions

Copy link

Updated by Michael Kay over 10 years ago

Clearly a crude backtracking implementation will fail on this unless there is some optimization that prevents the failure. I haven't found any description of an optimization that would prevent this failing in the literature, so the only way to find out why JDK doesn't fail is probably to study the source code, which isn't a very attractive prospect. There are always going to be some expressions that one engine optimizes and another doesn't. One possibility here is that the engine could detect that if the subexpression "m[^,]*," is present anywhere in the string, then the entire string will be replaced; detecting the presence of a substring that matches this subexpression should be very fast. That's an easy optimization but it seems very focused on this particular case (especially as it depends on the "s" flag for its correctness) and it's not clear that it will fire often enough to make it worth implementing.

Actions

Copy link

Updated by Michael Kay over 10 years ago

Status changed from New to In Progress

I've taken a look at how the JDK regex engine handles this, and basically it does the backtracking without such deep recursion. Given the sequence ".m", it first swallows the whole input sequence with the ., then discovers that "m" doesn't match, then goes backwards one character at a time until if finds a position at which "m" (or rather, the rest of the regex) does match. So it's basically doing one recursive call per atom in the regex, rather than one per character in the input.

Actions

Copy link

Updated by Michael Kay over 10 years ago

Category set to Performance
Assignee set to Michael Kay

I've been taking a fresh look at the design of the Jakarta engine (which I had modified in many details when incorporating it into Saxon, without changing the basic design). The design makes it quite hard to fix this because the engine keeps very little state information: essentially only a stack of which instruction is executing and the current position in the input; it's a very pure finite state machine.

Actions

Copy link

Updated by Michael Kay over 10 years ago

I've implemented changes to the regex engine to improve the stack usage: see http://dev.saxonica.com/blog/mike/2014/02/another-regex-rewrite.html. These have been running successfully in the 9.6 development branch for a while. I will consider retrofitting the redesigned code to the 9.5 branch.

Actions

Copy link