Bug #3494
closedBad analyze-string performance for regex with range quantifier
100%
Description
We recently saw a Schematron check that is supposed to detect typographical flaws run for half an hour instead of seconds. We tracked it down to an analyze-string with a weird regex:
(^|\w+){1,5}\s+(-|–|−)(\s?|\w+).{1,5}
The regex is supposed to extract text around a dash that is preceded by whitespace (as opposed to non-breaking space, which is the typographical rule that should be enforced here).
This regex is contained in attached XSLT file as $regex1.
If you invoke the XSLT’s main template, you will see an increase in computing time. The last sentence takes more than 10 minutes to analyze.
We reproduced this behaviour with Saxon PE 9.6.0.7, HE 9.7.0.8, and PE 9.8.0.5, while it finished within a second with Saxon HE 9.5.1.2.
The regex is admittedly on the fringe to being non-sensical. However, someone wrote it (while we were still using Saxon 9.5, btw), and the poor performance remained unnoticed because only recently our customer needed to convert a document with 312 en-dashes preceded by whitespace.
If you change '(^|\w+){1,5}' to '(^|\w+)', the performance will improve dramatically, and replacing '(-|–|−)' with '\p{Pd}' accelerated things, too.
I believe '(^|\w+){1,5}' was intended to extract up to five words before the whitespace that precedes the dash. I changed the regex to what is included as $regex2 in the XSLT. I also reduced the maximum number of preceding words in $regex2 to three which was significantly faster than '{1,5}'.
I hope that both regexes and the test sentences will be useful in analyzing this performance regression of the new regex engine. If you want to include a sentence in a test suite, maybe replace each letter with another one. Otherwise we might need to ask the publisher, author, or translator of the book, http://www.unionsverlag.com/info/title.asp?title_id=7793
Files
Please register to edit this issue