Project

Profile

Help

Bug #3055

closed

Out of memory when applying a regular expression

Added by Radu Coravu almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2016-12-01
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
9.7
Fix Committed on Branch:
trunk
Fixed in Maintenance Release:
Platforms:

Description

Basically when the attached XML is transform using the attached XSLT (which seems to be a stylesheet created by Dimitre Novatchev) the regular expression engine used by Saxon seems to consume the entire Java memory and fails.


Files

novatchev-getLines-fails.xsl (2.71 KB) novatchev-getLines-fails.xsl Radu Coravu, 2016-12-01 08:30
testGetLinesdata.xml (766 Bytes) testGetLinesdata.xml Radu Coravu, 2016-12-01 08:30
Actions #1

Updated by Michael Kay almost 8 years ago

Reproduced; test case is in bugs/2016/novatchev

The tricky bit seems to be ((\W*\w*)*?) which can match a zero-length string in an infinite number of ways, though of course we try to deal with that...

Actions #2

Updated by Michael Kay almost 8 years ago

  • Category set to Performance
  • Status changed from New to Won't fix
  • Assignee set to Michael Kay

I don't think there is a bug here, merely a failure to optimize.

It seems a perverse regular expression: ^((\W*\w*)*?)(\W+\w*)$

Because \W and \w are complements of each other, the expression \W*\w* matches any sequence of characters. Adding another repeat operator to make it (\W*\w*)* means that it matches any sequence of sequences of characters, in many different ways. Without optimization, this is a large number of possibilities to test.

As far as I can see the regular expression ^.*?(\W+\w*)$ is equivalent, and this executes without problems.

The equivalent expression ^((\W+\w+)*?)(\W+\w*)$ also executes without problems.

Switching to the Java regex engine by using flags=";j" also works (and produces the same result).

The reason we are failing to optimize this is that we fail to recognize that \W and \w are disjoint. There seem to be a number of ways that the test for disjointness could be improved.

Actions #3

Updated by Michael Kay almost 8 years ago

  • Status changed from Won't fix to In Progress

I've done some investigation and I think this is probably too disruptive to attempt on the 9.7 branch but we could consider tackling it for 9.8.

Some of the changes needed to improve the test for whether two match operations are disjoint / unambiguous:

  • currently we only consider the initial character matched by a CharClass or an Atom. We should extend this to all operations, so there is a general method to determine the possible initial character matches for any term.

  • some of the CharClass categories are represented by custom IntPredicate implementations which makes it impossible to reason about them, e.g. to determine that \W is the complement of \w. To enable such reasoning, we should use IntSetPredicate or IntComplementPredicate wherever possible.

Actions #4

Updated by Michael Kay almost 8 years ago

  • Status changed from In Progress to Resolved

Fixed on the development branch for 9.8.

The fix involves changing the internal representation of regex terms such as \w from an IntPredicate to a CharacterClass; whereas an IntPredicate can only test for membership of a character in the set, CharacterClass supports additional operations notably determining whether two character classes are disjoint. This enables us to establish that in the sequence \W*\w* the first repeat is unambiguous because \W and \w are disjoint, so processing of the sequence will never have to backtrack (the sequence of \W characters unambiguously ends when a \w is encountered). This eliminates the backtracking which was the cause of the reported stack overflow.

Actions #5

Updated by O'Neil Delpratt almost 8 years ago

  • Applies to branch 9.8 added
  • Fix Committed on Branch 9.8 added
Actions #6

Updated by O'Neil Delpratt over 7 years ago

  • Fix Committed on Branch trunk added
  • Fix Committed on Branch deleted (9.8)
Actions #7

Updated by O'Neil Delpratt over 7 years ago

  • Applies to branch 9.7 added
  • Applies to branch deleted (9.8)
Actions #8

Updated by Michael Kay over 7 years ago

  • Status changed from Resolved to Closed

Please register to edit this issue

Also available in: Atom PDF