Bug #3787
closed
Regex shouldn't match, but matches
Category:
XPath conformance
Applies to branch:
9.8, trunk
Fix Committed on Branch:
9.8, trunk
Fixed in Maintenance Release:
Description
I have a strange issue with an regex that matches - though it shouldn't. The regex is about matching a string that contains a certain amount of tokens, separated by a single space.
I tried to generalize/minimize the issue to the extent that it still occurs. In the example the token is just "x". The part regex that matches such a token could therefore also be "x". The original issue and possible tokens are a bit more complex, so I instead use "(\d*.)?x" to match a token. As neither digits nor full stops are contained in the checked string, the added optional part must not make a diference....but it does, in certain cases.
Please find attached an XSLT file that tries to match different versions of the regex. There are three pairs of regex, whereby always the first one shall not match and the second one shall match. The very first one of the six regexes shall not match as well - but it actually matches.
I checked with Saxon HE 9.8.0.12.
Possibly related/solved by #3782?
Files
Used cmdline:
java -jar saxon9he.jar -s:regex_test.xslt -xsl:regex_test.xslt \!indent=true
- Category set to XPath conformance
- Status changed from New to In Progress
- Assignee set to Michael Kay
Reduced it to this query
matches(' x x x', '^( (\d*\.)?x){2}$')
which returns true when it should return false. Further simplification is elusive.
Achieved a bit of extra simplification to:
matches(' x x x', '^(?: (?:a*b)?x){2}$')
(Non capturing groups, although more verbose, are simpler internally than capturing groups)
These things are very tough to debug. There's no obvious error in the static analysis of the regex. At run-time it appears to be doing some spurious backtracking. It correctly matches the ' x x', finds that the next thing isn't end-of-string, and then decides to try and match (a*b)?x a different (less greedy) way, and this second attempt for some reason is succeeding. Perhaps it's not resetting counters correctly when it backtracks.
Here's the internal trace output, with commentary:
Analyzing query from {matches(' x x x', '^(?: (?:a*b)?x){2}$')}
Iterating over OpSequence ^ a*b?x{2,2}$\Z at position 0 returning 0
Iterating over OpSequence ^ a*b?x{2,2}$ at position 0 returning 1
Iterating over OpSequence ^ a*b?x{2,2} at position 0 returning 2
Iterating over OpBOL ^ at position 0 returning IntSingletonIterator 3
IntIterator 3 hasNext() = true
IntIterator 3 next() = 0
Iterating over OpSequence a*b?x at position 0 returning 4
Iterating over OpSequence a*b? at position 0 returning 5
Iterating over OpAtom at position 0 returning IntSingletonIterator 6
IntIterator 6 hasNext() = true
IntIterator 6 next() = 1
Iterating over OpSequence a*b at position 1 returning 7
Iterating over OpAtom a at position 1 returning EmptyIntIterator 8
IntIterator 8 hasNext() = false
Iterating over OpGreedyFixed a* at position 1 returning IntStepIterator 9
IntIterator 9 hasNext() = true
IntIterator 9 next() = 1
Iterating over OpAtom b at position 1 returning EmptyIntIterator 10
IntIterator 10 hasNext() = false
IntIterator 9 hasNext() = false
IntIterator 7 hasNext() = false
Iterating over OpRepeat a*b? at position 1 returning ForceProgressIterator 11
Iterating over OpTrace a*b? at position 1 returning 12
IntIterator 11 hasNext() = true
IntIterator 12 hasNext() = true
IntIterator 11 next() = 1
IntIterator 12 next() = 1
IntIterator 5 hasNext() = true
IntIterator 5 next() = 1
Iterating over OpAtom x at position 1 returning IntSingletonIterator 13
IntIterator 13 hasNext() = true
IntIterator 13 next() = 2
IntIterator 4 hasNext() = true
IntIterator 4 next() = 2
Iterating over OpSequence a*b?x at position 2 returning 14
Iterating over OpSequence a*b? at position 2 returning 15
Iterating over OpAtom at position 2 returning IntSingletonIterator 16
IntIterator 16 hasNext() = true
IntIterator 16 next() = 3
**** [Label L]
Iterating over OpSequence a*b at position 3 returning 17
Iterating over OpAtom a at position 3 returning EmptyIntIterator 18
IntIterator 18 hasNext() = false
Iterating over OpGreedyFixed a* at position 3 returning IntStepIterator 19
IntIterator 19 hasNext() = true
IntIterator 19 next() = 3
Iterating over OpAtom b at position 3 returning EmptyIntIterator 20
IntIterator 20 hasNext() = false
IntIterator 19 hasNext() = false
IntIterator 17 hasNext() = false
Iterating over OpRepeat a*b? at position 3 returning ForceProgressIterator 21
Iterating over OpTrace a*b? at position 3 returning 22
IntIterator 21 hasNext() = true
IntIterator 22 hasNext() = true
IntIterator 21 next() = 3
IntIterator 22 next() = 3
IntIterator 15 hasNext() = true
IntIterator 15 next() = 3
Iterating over OpAtom x at position 3 returning IntSingletonIterator 23
IntIterator 23 hasNext() = true
IntIterator 23 next() = 4
IntIterator 14 hasNext() = true
IntIterator 14 next() = 4
Iterating over OpRepeat a*b?x{2,2} at position 0 returning ForceProgressIterator 24
Iterating over OpTrace a*b?x{2,2} at position 0 returning 25
IntIterator 24 hasNext() = true
IntIterator 25 hasNext() = true
IntIterator 24 next() = 4
IntIterator 25 next() = 4
IntIterator 2 hasNext() = true
IntIterator 2 next() = 4
Iterating over OpEOL $ at position 4 returning EmptyIntIterator 26
**** at this point we've matched a*b?x{2,2} and find that it's not followed by $. Backtrack to where we were at [Label L]
IntIterator 26 hasNext() = false
IntIterator 23 hasNext() = false
Iterating over OpSequence a*b at position 3 returning 27
Iterating over OpAtom a at position 3 returning EmptyIntIterator 28
IntIterator 28 hasNext() = false
Iterating over OpGreedyFixed a* at position 3 returning IntStepIterator 29
IntIterator 29 hasNext() = true
IntIterator 29 next() = 3
Iterating over OpAtom b at position 3 returning EmptyIntIterator 30
IntIterator 30 hasNext() = false
IntIterator 29 hasNext() = false
IntIterator 27 hasNext() = false
IntIterator 21 hasNext() = true
IntIterator 22 hasNext() = true
IntIterator 21 next() = 3
IntIterator 22 next() = 3
IntIterator 15 hasNext() = true
IntIterator 15 next() = 3
Iterating over OpAtom x at position 3 returning IntSingletonIterator 31
IntIterator 31 hasNext() = true
IntIterator 31 next() = 4
IntIterator 14 hasNext() = true
IntIterator 14 next() = 4
**** at this point we've gone wrong, we should not be trying to match a*b?x at position 4
Iterating over OpSequence a*b?x at position 4 returning 32
Iterating over OpSequence a*b? at position 4 returning 33
Iterating over OpAtom at position 4 returning IntSingletonIterator 34
IntIterator 34 hasNext() = true
IntIterator 34 next() = 5
Iterating over OpSequence a*b at position 5 returning 35
Iterating over OpAtom a at position 5 returning EmptyIntIterator 36
IntIterator 36 hasNext() = false
Iterating over OpGreedyFixed a* at position 5 returning IntStepIterator 37
IntIterator 37 hasNext() = true
IntIterator 37 next() = 5
Iterating over OpAtom b at position 5 returning EmptyIntIterator 38
IntIterator 38 hasNext() = false
IntIterator 37 hasNext() = false
IntIterator 35 hasNext() = false
Iterating over OpRepeat a*b? at position 5 returning ForceProgressIterator 39
Iterating over OpTrace a*b? at position 5 returning 40
IntIterator 39 hasNext() = true
IntIterator 40 hasNext() = true
IntIterator 39 next() = 5
IntIterator 40 next() = 5
IntIterator 33 hasNext() = true
IntIterator 33 next() = 5
Iterating over OpAtom x at position 5 returning IntSingletonIterator 41
IntIterator 41 hasNext() = true
IntIterator 41 next() = 6
IntIterator 32 hasNext() = true
IntIterator 32 next() = 6
IntIterator 24 hasNext() = true
IntIterator 25 hasNext() = true
IntIterator 24 next() = 6
IntIterator 25 next() = 6
IntIterator 2 hasNext() = true
IntIterator 2 next() = 6
Iterating over OpEOL $ at position 6 returning IntSingletonIterator 42
IntIterator 42 hasNext() = true
IntIterator 42 next() = 6
IntIterator 1 hasNext() = true
IntIterator 1 next() = 6
IntIterator 0 hasNext() = true
IntIterator 0 next() = 6
Changing line 836 of Operation.java from
iterators.size() <= bound
to
iterators.size() < bound
fixes the problem; but it needs a lot more study to be sure that this change has no unintended side-effects.
This change appears to cause no regression on the QT3 or XSLT3 test suites.
- Status changed from In Progress to Resolved
- Applies to branch trunk added
- Fix Committed on Branch 9.8, trunk added
I have examined the logic more carefully. It's not easy to understand its logic for all possible cases, but I've convinced myself that for this particular case, the change makes sense.
Applying the fix.
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 9.8.0.14 added
Bug fix applied in the Saxon 9.8.0.14 maintenance release.
Please register to edit this issue
Also available in: Atom
PDF