Bug #4240: ssllooww matching of HUGE regexp - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #4240

closed

ssllooww matching of HUGE regexp

Added by Syd Bauman over 5 years ago. Updated about 5 years ago.

Status:

Won't fix

Priority:

Low

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2019-06-24

Due date:

% Done:

Estimated time:

Legacy ID:

Applies to branch:

Fix Committed on Branch:

Fixed in Maintenance Release:

Platforms:

Description

The attached XSLT 3.0 program (also valid XSLT 2.0) is designed to be run with itself as the input file. Its purpose is to state whether or not the value of any @selector attribute found is a valid CSS3 selector. (It differs mildly from the CSS3 spec, but that's not important here.) It does this by comparing each @selector against an enormous regular expression using the matches() function with the 'i' switch.

When I ran some tests (using Saxon-HE 9.8.0.11J, I think) Saxon took a very long time to validate even a small number of @selector attributes. (40 @selectors in ~03:48 according to time on an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz). I did not use "-opt:e".

By contrast, jing processed a RELAX NG grammar that tested the same regular expression against the same 40 @selectors almost instantaneously (although it does not have an 'i' switch to set).

Sadly I did not record a lot of the details (like which 40 of the ~5900 test @selector attributes were processed). I am running some more tests now, but may not have the results available for awhile, so thought I should post this w/o waiting, as I told O'Neil I would submit this almost 2 weeks ago. (BTW, I have no idea if this should be assigned to him or not—I am just making sure to let him know it has been submitted. :-)

For further information see the paper at https://markupuk.org/webhelp/Syd_selector_regex.html?hl=bauman

Files

CSS3_selector_checker.xslt (558 KB) CSS3_selector_checker.xslt

XSLT pgm to run against itself while you go out for lunch

Syd Bauman, 2019-06-25 04:12

Actions

Copy link

Updated by Michael Kay over 5 years ago

Thanks for submitting this.

All regular expression evaluators "go exponential" in some worst cases involving extreme amounts of backtracking, but with some algorithms the "worst cases" arise more frequently than others. Some algorithms punt on evaluating multiple branches in parallel, which tends to use more memory in simple cases but make it less likely that evaluation will take exponential time in more complex cases.

We've put some effort over the years into eliminating some of these problems in the Saxon regex engine (which is derived from the Apache Jakarta engine) but we're fully aware it's not the world's best regex engine. For cases it doesn't handle well, it's worth trying the switch to use the standard JDK regex engine instead (add ";j" to the flags argument). Of course, that way, the regex semantics are no longer conformant with the XPath rules.

I'll take a look at this example, but it's so complex that it may be very difficult to work out exactly what's going on.

Very often it's not really the complexity of the regex that's the problem, but the size of the input. Even quite simple regular expressions can have performance that's exponential in the size of the input.

Actions

Copy link

Updated by Syd Bauman over 5 years ago

Thanks for the thoughts, Michael. Last night (after posting this) I tried to run the entire 5900 test file and (even with -Xmx6g) java (in particular, “OpenJDK Runtime Environment (8.0_212-b03) (build 1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03)” ran out of memory.

I can try changing the regex to use Java rules and using ':j', but probably shouldn’t spend time on that until some other projects are finished.

You will find it much easier to read and understand what the regex is doing if you read instead the source code that generated it: https://github.com/NEU-DSG/wwp-public-code-share/blob/master/tagsDecl/CSS3_selector_regex_generator.perl. But note that this is just a sketch of the actual intended program (which I have not written yet), so please read it in the spirit of dirty laundry, as it were.

As for the complexity of regex vs size of input issue I really don’t know, but it does look like the out-of-memory issue mentioned above occurred on one of the impossible long test cases (about 80% through 1,025 of .span, )

Actions

Copy link

Updated by Michael Kay over 5 years ago

Status changed from New to Won't fix
Assignee changed from O'Neil Delpratt to Michael Kay

I'm sorry Syd, but I don't think it's going to be a productive use of time to investigate this. We know that the algorithms used by the Saxon regex engine will "go exponential" in worst case scenarios due to excessive backtracking, and if we're going to improve the algorithms we need much simpler test cases to investigate, tests that demonstrate the problem but are still tractable in terms of examining the detailed evaluation strategy. This one is just too big for human comprehension. Regretfully, I'm therefore going to close it as unresolved.

Actions

Copy link