Project

Profile

Help

Bug #3494

closed

Bad analyze-string performance for regex with range quantifier

Added by Gerrit Imsieke over 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2017-10-23
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8, trunk
Fix Committed on Branch:
trunk
Fixed in Maintenance Release:
Platforms:

Description

We recently saw a Schematron check that is supposed to detect typographical flaws run for half an hour instead of seconds. We tracked it down to an analyze-string with a weird regex:

(^|\w+){1,5}\s+(-|–|−)(\s?|\w+).{1,5}

The regex is supposed to extract text around a dash that is preceded by whitespace (as opposed to non-breaking space, which is the typographical rule that should be enforced here).

This regex is contained in attached XSLT file as $regex1.

If you invoke the XSLT’s main template, you will see an increase in computing time. The last sentence takes more than 10 minutes to analyze.

We reproduced this behaviour with Saxon PE 9.6.0.7, HE 9.7.0.8, and PE 9.8.0.5, while it finished within a second with Saxon HE 9.5.1.2.

The regex is admittedly on the fringe to being non-sensical. However, someone wrote it (while we were still using Saxon 9.5, btw), and the poor performance remained unnoticed because only recently our customer needed to convert a document with 312 en-dashes preceded by whitespace.

If you change '(^|\w+){1,5}' to '(^|\w+)', the performance will improve dramatically, and replacing '(-|–|−)' with '\p{Pd}' accelerated things, too.

I believe '(^|\w+){1,5}' was intended to extract up to five words before the whitespace that precedes the dash. I changed the regex to what is included as $regex2 in the XSLT. I also reduced the maximum number of preceding words in $regex2 to three which was significantly faster than '{1,5}'.

I hope that both regexes and the test sentences will be useful in analyzing this performance regression of the new regex engine. If you want to include a sentence in a test suite, maybe replace each letter with another one. Otherwise we might need to ask the publisher, author, or translator of the book, http://www.unionsverlag.com/info/title.asp?title_id=7793


Files

tei-sch.xsl (5.74 KB) tei-sch.xsl -it:main; contains test document Gerrit Imsieke, 2017-10-23 23:45
Actions #1

Updated by Michael Kay over 6 years ago

I think that worst-case regex performance is always going to be exponential in input sentence length and the only question is trying to avoid this worst-case effect as often as possible.

I'm surprised that replacing '(-|–|−)' with '\p{Pd}' should make any difference and I don't observe much effect from that change.

I'm not surprised that (^|\w+){1,5} should be inefficient and I'll see if there's any way the optimizer can be tweaked to recognise such cases. Reducing (\w+){1,5} to (\w+) is easy enough, but the "^" alternative complicates things a lot. I'm reluctant to put in an optimizer rule that will handle this special case and nothing else.

Between 9.5 and 9.6 we completely replaced the regex engine in Saxon, and it's not surprising that some cases optimised by one regex engine should not be optimised by another, and vice versa.

Actions #2

Updated by Michael Kay over 6 years ago

The capturing group also makes the rewrite more difficult: we don't know statically that regex-group() isn't being used and any rewrite of the regex therefore needs to preserve the semantics of capturing groups: that's a challenge. However, the spec gives implementations some flexibility: "For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup."

Actions #3

Updated by Gerrit Imsieke over 6 years ago

Maybe, instead of trying to improve performance for pathological fringe cases, issue a warning when the algorithm detects that it’s taking “too many” steps in relation to the input length?

Actions #4

Updated by Michael Kay over 6 years ago

Just an observation: the performance here depends on the length of the longest word. In the last example appears the 30-character word Finanzstabilisierungsfazilität. The expression (^|\w+){1,5} is going to try and match this in something like 3029282726 different ways; that is, the performance is O(N^5) where N is the length of the longest word. This is why changing it to {1,3} makes a big difference.

(Just realised that the way I wrote that is in German word order; clearly reading this text has activated my German-speaking brain circuits!)

The classic way that some regex engines avoid exponential performance in such cases is by performing a single scan of the input, keeping track of all the possible states in the FSA at each step. So after reading "F" the only possible counter values are (11); after reading "Fi" they are (21) and (12); after "Fin" the possible combinations are (31), (21 + 11), (12 + 11), (1*3). The number of possible states here is O(n^2) because there are two nested loops. The problem is that (a) although this approach avoids the worst-case performance problems, it's not better at handling typical cases, and (b) it's hard to see how to do it without a total rewrite.

Actions #5

Updated by Michael Kay over 6 years ago

  • Category set to Performance
  • Status changed from New to Resolved
  • Assignee set to Michael Kay
  • Priority changed from Low to Normal
  • Applies to branch trunk added
  • Applies to branch deleted (9.6, 9.7)
  • Fix Committed on Branch trunk added

I'm not going to produce a fix on the 9.8 branch - the code is working as designed.

For the next release, I've followed up the suggestion of detection and user advice. I've added a configurable option "regex backtracking limit" defaulting to 10 million (about 10 seconds running) and a hard failure if the limit is exceeded while evaluating a single regex: the user then has the options of changing the limit, removing the limit, or improving the regular expression.

Actions #6

Updated by O'Neil Delpratt over 5 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.9.0.1 added

Bug fix applied in the Saxon 9.9.0.1 major release.

Please register to edit this issue

Also available in: Atom PDF