Project

Profile

Help

Bug #3787

Regex shouldn't match, but matches

Added by Stefan Pöschel about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2018-05-16
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8, trunk
Fix Committed on Branch:
9.8, trunk
Fixed in Maintenance Release:

Description

I have a strange issue with an regex that matches - though it shouldn't. The regex is about matching a string that contains a certain amount of tokens, separated by a single space.

I tried to generalize/minimize the issue to the extent that it still occurs. In the example the token is just "x". The part regex that matches such a token could therefore also be "x". The original issue and possible tokens are a bit more complex, so I instead use "(\d*.)?x" to match a token. As neither digits nor full stops are contained in the checked string, the added optional part must not make a diference....but it does, in certain cases.

Please find attached an XSLT file that tries to match different versions of the regex. There are three pairs of regex, whereby always the first one shall not match and the second one shall match. The very first one of the six regexes shall not match as well - but it actually matches.

I checked with Saxon HE 9.8.0.12.

Possibly related/solved by #3782?

regex_test.xslt (2.39 KB) regex_test.xslt Stefan Pöschel, 2018-05-16 08:06

History

#1 Updated by Stefan Pöschel about 2 years ago

Used cmdline:

java -jar saxon9he.jar -s:regex_test.xslt -xsl:regex_test.xslt \!indent=true

#2 Updated by Michael Kay about 2 years ago

  • Category set to XPath conformance
  • Status changed from New to In Progress
  • Assignee set to Michael Kay

Reduced it to this query

matches(' x x x', '^( (\d*\.)?x){2}$')

which returns true when it should return false. Further simplification is elusive.

#3 Updated by Michael Kay about 2 years ago

Achieved a bit of extra simplification to:

matches(' x x x', '^(?: (?:a*b)?x){2}$')

(Non capturing groups, although more verbose, are simpler internally than capturing groups)

These things are very tough to debug. There's no obvious error in the static analysis of the regex. At run-time it appears to be doing some spurious backtracking. It correctly matches the ' x x', finds that the next thing isn't end-of-string, and then decides to try and match (a*b)?x a different (less greedy) way, and this second attempt for some reason is succeeding. Perhaps it's not resetting counters correctly when it backtracks.

#4 Updated by Michael Kay about 2 years ago

Here's the internal trace output, with commentary:

Analyzing query from {matches(' x x x', '^(?: (?:a*b)?x){2}$')}
Iterating over OpSequence ^ a*b?x{2,2}$\Z at position 0 returning  0
Iterating over OpSequence ^ a*b?x{2,2}$ at position 0 returning  1
Iterating over OpSequence ^ a*b?x{2,2} at position 0 returning  2
Iterating over OpBOL ^ at position 0 returning IntSingletonIterator 3
IntIterator 3 hasNext() = true
IntIterator 3 next() = 0
Iterating over OpSequence  a*b?x at position 0 returning  4
Iterating over OpSequence  a*b? at position 0 returning  5
Iterating over OpAtom   at position 0 returning IntSingletonIterator 6
IntIterator 6 hasNext() = true
IntIterator 6 next() = 1
Iterating over OpSequence a*b at position 1 returning  7
Iterating over OpAtom a at position 1 returning EmptyIntIterator 8
IntIterator 8 hasNext() = false
Iterating over OpGreedyFixed a* at position 1 returning IntStepIterator 9
IntIterator 9 hasNext() = true
IntIterator 9 next() = 1
Iterating over OpAtom b at position 1 returning EmptyIntIterator 10
IntIterator 10 hasNext() = false
IntIterator 9 hasNext() = false
IntIterator 7 hasNext() = false
Iterating over OpRepeat a*b? at position 1 returning ForceProgressIterator 11
Iterating over OpTrace a*b? at position 1 returning  12
IntIterator 11 hasNext() = true
IntIterator 12 hasNext() = true
IntIterator 11 next() = 1
IntIterator 12 next() = 1
IntIterator 5 hasNext() = true
IntIterator 5 next() = 1
Iterating over OpAtom x at position 1 returning IntSingletonIterator 13
IntIterator 13 hasNext() = true
IntIterator 13 next() = 2
IntIterator 4 hasNext() = true
IntIterator 4 next() = 2
Iterating over OpSequence  a*b?x at position 2 returning  14
Iterating over OpSequence  a*b? at position 2 returning  15
Iterating over OpAtom   at position 2 returning IntSingletonIterator 16
IntIterator 16 hasNext() = true
IntIterator 16 next() = 3

**** [Label L]

Iterating over OpSequence a*b at position 3 returning  17
Iterating over OpAtom a at position 3 returning EmptyIntIterator 18
IntIterator 18 hasNext() = false
Iterating over OpGreedyFixed a* at position 3 returning IntStepIterator 19
IntIterator 19 hasNext() = true
IntIterator 19 next() = 3
Iterating over OpAtom b at position 3 returning EmptyIntIterator 20
IntIterator 20 hasNext() = false
IntIterator 19 hasNext() = false
IntIterator 17 hasNext() = false
Iterating over OpRepeat a*b? at position 3 returning ForceProgressIterator 21
Iterating over OpTrace a*b? at position 3 returning  22
IntIterator 21 hasNext() = true
IntIterator 22 hasNext() = true
IntIterator 21 next() = 3
IntIterator 22 next() = 3
IntIterator 15 hasNext() = true
IntIterator 15 next() = 3
Iterating over OpAtom x at position 3 returning IntSingletonIterator 23
IntIterator 23 hasNext() = true
IntIterator 23 next() = 4
IntIterator 14 hasNext() = true
IntIterator 14 next() = 4
Iterating over OpRepeat  a*b?x{2,2} at position 0 returning ForceProgressIterator 24
Iterating over OpTrace  a*b?x{2,2} at position 0 returning  25
IntIterator 24 hasNext() = true
IntIterator 25 hasNext() = true
IntIterator 24 next() = 4
IntIterator 25 next() = 4
IntIterator 2 hasNext() = true
IntIterator 2 next() = 4
Iterating over OpEOL $ at position 4 returning EmptyIntIterator 26

**** at this point we've matched a*b?x{2,2} and find that it's not followed by $. Backtrack to where we were at [Label L]

IntIterator 26 hasNext() = false
IntIterator 23 hasNext() = false
Iterating over OpSequence a*b at position 3 returning  27
Iterating over OpAtom a at position 3 returning EmptyIntIterator 28
IntIterator 28 hasNext() = false
Iterating over OpGreedyFixed a* at position 3 returning IntStepIterator 29
IntIterator 29 hasNext() = true
IntIterator 29 next() = 3
Iterating over OpAtom b at position 3 returning EmptyIntIterator 30
IntIterator 30 hasNext() = false
IntIterator 29 hasNext() = false
IntIterator 27 hasNext() = false
IntIterator 21 hasNext() = true
IntIterator 22 hasNext() = true
IntIterator 21 next() = 3
IntIterator 22 next() = 3
IntIterator 15 hasNext() = true
IntIterator 15 next() = 3
Iterating over OpAtom x at position 3 returning IntSingletonIterator 31
IntIterator 31 hasNext() = true
IntIterator 31 next() = 4
IntIterator 14 hasNext() = true
IntIterator 14 next() = 4

**** at this point we've gone wrong, we should not be trying to match a*b?x at position 4

Iterating over OpSequence  a*b?x at position 4 returning  32
Iterating over OpSequence  a*b? at position 4 returning  33
Iterating over OpAtom   at position 4 returning IntSingletonIterator 34
IntIterator 34 hasNext() = true
IntIterator 34 next() = 5
Iterating over OpSequence a*b at position 5 returning  35
Iterating over OpAtom a at position 5 returning EmptyIntIterator 36
IntIterator 36 hasNext() = false
Iterating over OpGreedyFixed a* at position 5 returning IntStepIterator 37
IntIterator 37 hasNext() = true
IntIterator 37 next() = 5
Iterating over OpAtom b at position 5 returning EmptyIntIterator 38
IntIterator 38 hasNext() = false
IntIterator 37 hasNext() = false
IntIterator 35 hasNext() = false
Iterating over OpRepeat a*b? at position 5 returning ForceProgressIterator 39
Iterating over OpTrace a*b? at position 5 returning  40
IntIterator 39 hasNext() = true
IntIterator 40 hasNext() = true
IntIterator 39 next() = 5
IntIterator 40 next() = 5
IntIterator 33 hasNext() = true
IntIterator 33 next() = 5
Iterating over OpAtom x at position 5 returning IntSingletonIterator 41
IntIterator 41 hasNext() = true
IntIterator 41 next() = 6
IntIterator 32 hasNext() = true
IntIterator 32 next() = 6
IntIterator 24 hasNext() = true
IntIterator 25 hasNext() = true
IntIterator 24 next() = 6
IntIterator 25 next() = 6
IntIterator 2 hasNext() = true
IntIterator 2 next() = 6
Iterating over OpEOL $ at position 6 returning IntSingletonIterator 42
IntIterator 42 hasNext() = true
IntIterator 42 next() = 6
IntIterator 1 hasNext() = true
IntIterator 1 next() = 6
IntIterator 0 hasNext() = true
IntIterator 0 next() = 6

#5 Updated by Michael Kay about 2 years ago

Changing line 836 of Operation.java from

iterators.size() <= bound

to

iterators.size() < bound

fixes the problem; but it needs a lot more study to be sure that this change has no unintended side-effects.

#6 Updated by Michael Kay about 2 years ago

This change appears to cause no regression on the QT3 or XSLT3 test suites.

#7 Updated by Michael Kay about 2 years ago

  • Status changed from In Progress to Resolved
  • Applies to branch trunk added
  • Fix Committed on Branch 9.8, trunk added

I have examined the logic more carefully. It's not easy to understand its logic for all possible cases, but I've convinced myself that for this particular case, the change makes sense.

Applying the fix.

#8 Updated by Debbie Lockett about 2 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.8.0.14 added

Bug fix applied in the Saxon 9.8.0.14 maintenance release.

Please register to edit this issue

Also available in: Atom PDF