Bug #4240
closedssllooww matching of HUGE regexp
0%
Description
The attached XSLT 3.0 program (also valid XSLT 2.0) is designed to be run with itself as the input file. Its purpose is to state whether or not the value of any @selector attribute found is a valid CSS3 selector. (It differs mildly from the CSS3 spec, but that's not important here.) It does this by comparing each @selector against an enormous regular expression using the matches() function with the 'i' switch.
When I ran some tests (using Saxon-HE 9.8.0.11J, I think) Saxon took a very long time to validate even a small number of @selector attributes. (40 @selectors in ~03:48 according to time
on an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz). I did not use "-opt:e".
By contrast, jing
processed a RELAX NG grammar that tested the same regular expression against the same 40 @selectors almost instantaneously (although it does not have an 'i' switch to set).
Sadly I did not record a lot of the details (like which 40 of the ~5900 test @selector attributes were processed). I am running some more tests now, but may not have the results available for awhile, so thought I should post this w/o waiting, as I told O'Neil I would submit this almost 2 weeks ago. (BTW, I have no idea if this should be assigned to him or not—I am just making sure to let him know it has been submitted. :-)
For further information see the paper at https://markupuk.org/webhelp/Syd_selector_regex.html?hl=bauman
Files
Updated by Michael Kay over 5 years ago
Thanks for submitting this.
All regular expression evaluators "go exponential" in some worst cases involving extreme amounts of backtracking, but with some algorithms the "worst cases" arise more frequently than others. Some algorithms punt on evaluating multiple branches in parallel, which tends to use more memory in simple cases but make it less likely that evaluation will take exponential time in more complex cases.
We've put some effort over the years into eliminating some of these problems in the Saxon regex engine (which is derived from the Apache Jakarta engine) but we're fully aware it's not the world's best regex engine. For cases it doesn't handle well, it's worth trying the switch to use the standard JDK regex engine instead (add ";j" to the flags argument). Of course, that way, the regex semantics are no longer conformant with the XPath rules.
I'll take a look at this example, but it's so complex that it may be very difficult to work out exactly what's going on.
Very often it's not really the complexity of the regex that's the problem, but the size of the input. Even quite simple regular expressions can have performance that's exponential in the size of the input.
Updated by Syd Bauman over 5 years ago
Thanks for the thoughts, Michael. Last night (after posting this) I tried to run the entire 5900 test file and (even with -Xmx6g) java (in particular, “OpenJDK Runtime Environment (8.0_212-b03) (build 1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03)” ran out of memory.
I can try changing the regex to use Java rules and using ':j', but probably shouldn’t spend time on that until some other projects are finished.
You will find it much easier to read and understand what the regex is doing if you read instead the source code that generated it: https://github.com/NEU-DSG/wwp-public-code-share/blob/master/tagsDecl/CSS3_selector_regex_generator.perl. But note that this is just a sketch of the actual intended program (which I have not written yet), so please read it in the spirit of dirty laundry, as it were.
As for the complexity of regex vs size of input issue I really don’t know, but it does look like the out-of-memory issue mentioned above occurred on one of the impossible long test cases (about 80% through 1,025 of .span,
)
Updated by Michael Kay over 5 years ago
- Status changed from New to Won't fix
- Assignee changed from O'Neil Delpratt to Michael Kay
I'm sorry Syd, but I don't think it's going to be a productive use of time to investigate this. We know that the algorithms used by the Saxon regex engine will "go exponential" in worst case scenarios due to excessive backtracking, and if we're going to improve the algorithms we need much simpler test cases to investigate, tests that demonstrate the problem but are still tractable in terms of examining the detailed evaluation strategy. This one is just too big for human comprehension. Regretfully, I'm therefore going to close it as unresolved.
Updated by Syd Bauman about 5 years ago
No worries. If I come across a regexp that “explodes” but is more reasonable for humans, will let you know.
Please register to edit this issue