Bug #3494: Bad analyze-string performance for regex with range quantifier - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #3494

closed

Bad analyze-string performance for regex with range quantifier

Added by Gerrit Imsieke over 6 years ago. Updated almost 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2017-10-23

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.8, trunk

Fix Committed on Branch:

trunk

Fixed in Maintenance Release:

9.9.0.1

Platforms:

Description

We recently saw a Schematron check that is supposed to detect typographical flaws run for half an hour instead of seconds. We tracked it down to an analyze-string with a weird regex:

(^|\w+){1,5}\s+(-|&#x2013;|&#x2212;)(\s?|\w+).{1,5}

The regex is supposed to extract text around a dash that is preceded by whitespace (as opposed to non-breaking space, which is the typographical rule that should be enforced here).

This regex is contained in attached XSLT file as $regex1.

If you invoke the XSLT’s main template, you will see an increase in computing time. The last sentence takes more than 10 minutes to analyze.

We reproduced this behaviour with Saxon PE 9.6.0.7, HE 9.7.0.8, and PE 9.8.0.5, while it finished within a second with Saxon HE 9.5.1.2.

The regex is admittedly on the fringe to being non-sensical. However, someone wrote it (while we were still using Saxon 9.5, btw), and the poor performance remained unnoticed because only recently our customer needed to convert a document with 312 en-dashes preceded by whitespace.

If you change '(^|\w+){1,5}' to '(^|\w+)', the performance will improve dramatically, and replacing '(-|–|−)' with '\p{Pd}' accelerated things, too.

I believe '(^|\w+){1,5}' was intended to extract up to five words before the whitespace that precedes the dash. I changed the regex to what is included as $regex2 in the XSLT. I also reduced the maximum number of preceding words in $regex2 to three which was significantly faster than '{1,5}'.

I hope that both regexes and the test sentences will be useful in analyzing this performance regression of the new regex engine. If you want to include a sentence in a test suite, maybe replace each letter with another one. Otherwise we might need to ask the publisher, author, or translator of the book, http://www.unionsverlag.com/info/title.asp?title_id=7793

Files

tei-sch.xsl (5.74 KB) tei-sch.xsl

-it:main; contains test document

Gerrit Imsieke, 2017-10-23 23:45

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #3494

Bad analyze-string performance for regex with range quantifier

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Gerrit Imsieke over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by O'Neil Delpratt almost 6 years ago