Project

Profile

Help

Bug #3494

closed

Bad analyze-string performance for regex with range quantifier

Added by Gerrit Imsieke over 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2017-10-23
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8, trunk
Fix Committed on Branch:
trunk
Fixed in Maintenance Release:
Platforms:

Description

We recently saw a Schematron check that is supposed to detect typographical flaws run for half an hour instead of seconds. We tracked it down to an analyze-string with a weird regex:

(^|\w+){1,5}\s+(-|–|−)(\s?|\w+).{1,5}

The regex is supposed to extract text around a dash that is preceded by whitespace (as opposed to non-breaking space, which is the typographical rule that should be enforced here).

This regex is contained in attached XSLT file as $regex1.

If you invoke the XSLT’s main template, you will see an increase in computing time. The last sentence takes more than 10 minutes to analyze.

We reproduced this behaviour with Saxon PE 9.6.0.7, HE 9.7.0.8, and PE 9.8.0.5, while it finished within a second with Saxon HE 9.5.1.2.

The regex is admittedly on the fringe to being non-sensical. However, someone wrote it (while we were still using Saxon 9.5, btw), and the poor performance remained unnoticed because only recently our customer needed to convert a document with 312 en-dashes preceded by whitespace.

If you change '(^|\w+){1,5}' to '(^|\w+)', the performance will improve dramatically, and replacing '(-|–|−)' with '\p{Pd}' accelerated things, too.

I believe '(^|\w+){1,5}' was intended to extract up to five words before the whitespace that precedes the dash. I changed the regex to what is included as $regex2 in the XSLT. I also reduced the maximum number of preceding words in $regex2 to three which was significantly faster than '{1,5}'.

I hope that both regexes and the test sentences will be useful in analyzing this performance regression of the new regex engine. If you want to include a sentence in a test suite, maybe replace each letter with another one. Otherwise we might need to ask the publisher, author, or translator of the book, http://www.unionsverlag.com/info/title.asp?title_id=7793


Files

tei-sch.xsl (5.74 KB) tei-sch.xsl -it:main; contains test document Gerrit Imsieke, 2017-10-23 23:45

Please register to edit this issue

Also available in: Atom PDF