Bug #2419: Saxon transformation never ends when processing regular expressions - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2419

closed

Saxon transformation never ends when processing regular expressions

Added by Octavian Nadolu almost 9 years ago. Updated over 8 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2015-07-22

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

Fix Committed on Branch:

Fixed in Maintenance Release:

Platforms:

Description

Hello,

An oXygen user reported a problem regarding Saxon transformation that never ends. I reproduced the problem and I created a simple XSL and XML that are attached to this issue.

The problem it seems to be when regular expressions are processed. Also the XML file contains a paragraph with some German characters.

I reproduced this with the latest Saxon 9.6.0.6., but it works fine with Saxon 9.5.1.7.

Best Regards,

Octavian

Files

Download all files

quote.xsl (358 Bytes) quote.xsl		Octavian Nadolu, 2015-07-22 11:05
testQuote.xml (397 Bytes) testQuote.xml		Octavian Nadolu, 2015-07-22 11:05

Actions

Copy link

Updated by Michael Kay almost 9 years ago

Status changed from New to In Progress

It does theoretically terminate, as can be established by trying it on a much shorter string. The problem is that the backtracking makes it exponential in the length of the string. That is, if matching a one-character string takes 1 microsecond, then matching a 137 character string takes 2^137 microseconds, which takes you well beyond the expected life of the solar system.

I'll have a look to see if there is an optimization that can be applied. An obvious one is to extract the precondition matches(., '\d{4}'), but that only helps for strings that don't match this precondition. On the other hand, a string that does match '\d{4}' will match quickly anyway, because the backtracking never kicks in.

The backtracking algorithm was rewritten in 9.6, largely to avoid exploding the amount of stack space needed rather than to reduce time. There's a space-time trade-off involved here and we may have swung too far the other way...

There's almost certainly a way of rewriting the regex that will perform much better. As far as I can see the leading pattern ^(\S+[\s,])+ matches every string that starts with a non-space character, that is, it's equivalent to ^\S. The problem is that a typical string matches the pattern in many different ways, and we are trying them all.

Another approach we could use here is to augment the "history" mechanism that we currently use to avoid looping when we try to match a zero-length string repeatedly at the same position. Again there's a bit of space-time tradeoff here, but if we were to remember the substrings that ^(\S+[\s,]*)+ has already matched, then we don't need to repeat the same match attempt during backtracking. But this might only work in cases where there's no need to remember the captured substrings.

Actions

Copy link

Updated by Michael Kay almost 9 years ago

The effect is described here:

http://blog.codinghorror.com/regex-performance/

Actions

Copy link

Updated by Michael Kay almost 9 years ago

And the classic article on the dangers of backtracking in regex engines is here:

https://swtch.com/~rsc/regexp/regexp1.html

(It talks about the benefits of the alternative Thompson NFA algorithm, but the problem is that this doesn't handle backreferences and captured groups).

And the use of "history" to reduce the cost of backtracking (actually to reduce the worst-case cost at some expense in average-case cost) is described here:

http://perlmonks.org/index.pl?node_id=502408

Actions

Copy link

Updated by Michael Kay almost 9 years ago

Category set to Performance
Assignee set to Michael Kay
Found in version set to 9.6

For 9.7 I have added a system of preconditions to the regex matcher. In this particular case (regex='^(\S+[\s,]*)+\d{4}'), the pattern \d{4} is recognized as a precondition, and if this does not match somewhere in the input string, the entire regex returns false. This avoids the exponential backtracking that otherwise occurs in the non-matching case.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Status changed from In Progress to Resolved
Fixed in version set to 9.7

Marking this as "resolved" on the basis that improvements have been made in 9.7 which will benefit some cases including this particular one. No change for 9.6.

Actions

Copy link