Project

Profile

Help

Bug #6426

closed

fn:matches surprisingly returns false for fn:matches("AB", "^(.*)+B")

Added by Michael Kay 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2024-05-15
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
12, trunk
Fix Committed on Branch:
12, trunk
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

From: https://github.com/w3c/qt3tests/issues/59

With Saxon, fn:matches("AB", "^(.*)+B") returns false, and I don't understand why. I've ported Saxon's regex engine to Rust, and I'm also getting a false here (once I fixed another bug), which surprised me, so I verified that Saxon does the same.

Looking at it, the .* should consume "A", and then "B" should be matched. But it doesn't do that.

On regex101 all the flavors have it match https://regex101.com/r/JfmqWf/1

I'm trying to understand why it wouldn't match for the XPath regex flavor. And whether it should match or not, I think it's worthwhile adding a test case to the test suite to nail this behavior down.

This behavior is implemented inside of the Repeat op, for greedy matching. I suspect it's in advance() (in the Java code). I see a bug was fixed previously (3787).

Actions #1

Updated by Michael Kay 4 months ago

Well, just on the off-chance, I tried undoing the change made by bug 3787, and it made no difference.

Looking at bug #3787, I clearly wasn't 100% comfortable with it.

Repetitions of things that can be empty (as here (.*)+ are a bit of a nightmare, there is a lot of code to make sure we don't get non-termination. The '+' in this regex is obviously totally redundant (and dropping it fixes the problem), but that doesn't help much. One approach might be to optimize the + away at compile time, but the semantics are complicated by the fact that it's a capturing group.

I imagine it's the nested backtracking that causes the trouble. But I'm not sure why. What should happen is that .* first greedily captures the whole input, we then fail because the iteration isn't terminated by a B. So we try backtracking the outer loop, but we've only been round it once and that's the minimum, so we should try rolling the inner loop back one character, and that should work.

Actions #2

Updated by Michael Kay 4 months ago

Switched on debug trace, and it's reasonably compact:

Iterating over OpSequence ^([....]*)B\Z at position 0 returning OpSequence$1 0
Iterating over OpSequence ^([....]*)B at position 0 returning OpSequence$1 1
Iterating over OpSequence ^([....]*) at position 0 returning OpSequence$1 2
Iterating over OpBOL ^ at position 0 returning IntSingletonIterator 3
IntIterator 3 hasNext() = true
IntIterator 3 next() = 0
Iterating over OpCharClass [....] at position 0 returning IntSingletonIterator 4
IntIterator 4 hasNext() = true
IntIterator 4 next() = 1
Iterating over OpCharClass [....] at position 1 returning IntSingletonIterator 5
IntIterator 5 hasNext() = true
IntIterator 5 next() = 2
Iterating over OpCharClass [....] at position 2 returning EmptyIntIterator 6
IntIterator 6 hasNext() = false
Iterating over OpGreedyFixed [....]* at position 0 returning IntStepIterator 7
IntIterator 7 hasNext() = true
IntIterator 7 next() = 2
IntIterator 2 hasNext() = true
IntIterator 2 next() = 2
Iterating over OpAtom B at position 2 returning EmptyIntIterator 8
IntIterator 8 hasNext() = false
IntIterator 7 hasNext() = true
IntIterator 7 next() = 1
IntIterator 2 hasNext() = true
IntIterator 2 next() = 1
Iterating over OpAtom B at position 1 returning IntSingletonIterator 9
IntIterator 9 hasNext() = true
IntIterator 9 next() = 2
IntIterator 1 hasNext() = true
IntIterator 1 next() = 2
IntIterator 0 hasNext() = true
IntIterator 0 next() = 2

The noticeable thing here is:

Iterating over OpAtom B at position 1 returning IntSingletonIterator 9
IntIterator 9 hasNext() = true

which suggests to me that the B was matched. How can that happen without the regex as a whole succeeding?

Actions #3

Updated by Michael Kay 4 months ago

I failed to spot that with tracing enabled, the query returns true. That's because I was running with the modified regex without the +.

When I reinstate the +, the query returns true with tracing enabled, and false with tracing switched off. Ouch - a Heisenbug.

This is the trace of the working version:

Iterating over OpSequence ^([....]*)*B\Z at position 0 returning OpSequence$1 0
Iterating over OpSequence ^([....]*)*B at position 0 returning OpSequence$1 1
Iterating over OpSequence ^([....]*)* at position 0 returning OpSequence$1 2
Iterating over OpBOL ^ at position 0 returning IntSingletonIterator 3
IntIterator 3 hasNext() = true
IntIterator 3 next() = 0
Iterating over OpCharClass [....] at position 0 returning IntSingletonIterator 4
IntIterator 4 hasNext() = true
IntIterator 4 next() = 1
Iterating over OpCharClass [....] at position 1 returning IntSingletonIterator 5
IntIterator 5 hasNext() = true
IntIterator 5 next() = 2
Iterating over OpCharClass [....] at position 2 returning EmptyIntIterator 6
IntIterator 6 hasNext() = false
Iterating over OpGreedyFixed [....]* at position 0 returning IntStepIterator 7
IntIterator 7 hasNext() = true
IntIterator 7 next() = 2
Iterating over OpCharClass [....] at position 2 returning EmptyIntIterator 8
IntIterator 8 hasNext() = false
Iterating over OpGreedyFixed [....]* at position 2 returning IntStepIterator 9
IntIterator 9 hasNext() = true
IntIterator 9 next() = 2
Iterating over OpCharClass [....] at position 2 returning EmptyIntIterator 10
IntIterator 10 hasNext() = false
Iterating over OpGreedyFixed [....]* at position 2 returning IntStepIterator 11
IntIterator 11 hasNext() = true
IntIterator 11 next() = 2
Iterating over OpRepeat ([....]*)* at position 0 returning Operation$ForceProgressIterator 12
IntIterator 12 hasNext() = true
IntIterator 12 next() = 2
IntIterator 2 hasNext() = true
IntIterator 2 next() = 2
Iterating over OpAtom B at position 2 returning EmptyIntIterator 13
IntIterator 13 hasNext() = false
IntIterator 11 hasNext() = false
IntIterator 12 hasNext() = true
IntIterator 12 next() = 2
IntIterator 2 hasNext() = true
IntIterator 2 next() = 2
Iterating over OpAtom B at position 2 returning EmptyIntIterator 14
IntIterator 14 hasNext() = false
IntIterator 9 hasNext() = false
IntIterator 12 hasNext() = true
IntIterator 12 next() = 2
IntIterator 2 hasNext() = true
IntIterator 2 next() = 2
Iterating over OpAtom B at position 2 returning EmptyIntIterator 15
IntIterator 15 hasNext() = false
IntIterator 7 hasNext() = true
IntIterator 7 next() = 1
Iterating over OpCharClass [....] at position 1 returning IntSingletonIterator 16
IntIterator 16 hasNext() = true
IntIterator 16 next() = 2
Iterating over OpCharClass [....] at position 2 returning EmptyIntIterator 17
IntIterator 17 hasNext() = false
Iterating over OpGreedyFixed [....]* at position 1 returning IntStepIterator 18
IntIterator 18 hasNext() = true
IntIterator 18 next() = 2
IntIterator 12 hasNext() = true
IntIterator 12 next() = 2
IntIterator 2 hasNext() = true
IntIterator 2 next() = 2
Iterating over OpAtom B at position 2 returning EmptyIntIterator 19
IntIterator 19 hasNext() = false
IntIterator 18 hasNext() = true
IntIterator 18 next() = 1
IntIterator 12 hasNext() = true
IntIterator 12 next() = 1
IntIterator 2 hasNext() = true
IntIterator 2 next() = 1
Iterating over OpAtom B at position 1 returning IntSingletonIterator 20
IntIterator 20 hasNext() = true
IntIterator 20 next() = 2
IntIterator 1 hasNext() = true
IntIterator 1 next() = 2
IntIterator 0 hasNext() = true
IntIterator 0 next() = 2

There are 4 unsuccessful attempts to match B at position 2 before it finally backtracks both nested loops to try at position 1, at which point it succeeds.

With tracing switched off and a debugger breakpoint at OpAtom.iterateMatches(), I'm seeing 4 attempts to match at position 2, and no attempt to match at position 1.

Actions #4

Updated by Michael Kay 4 months ago

When there's a construct that allows an infinite number of zero-length matches, we prevent non-termination by injecting a ForceProgressIterator. This gives up after a certain number of zero-length matches at a given position. The threshold is currently set to 3 (and there's a comment that points out a regex that can defeat this). If we increase the threshold to 8, the match succeeds. Obviously that's fine for this case but doesn't solve the general problem.

Is there a smarter way to detect that we aren't making progress and need to give up?

Actions #5

Updated by Michael Kay 4 months ago

Perhaps we should set an upper bound on the number of different ways a given position in the input string can be reached -- like, the position raised to the power of the maximum nesting depth of repetition constructs.

The lowest value that works for this case (AB) is 5. If we change the string to AAB, it is 9. For AAAB, it is 14. For AAAAB, 20. For AAAAAB, 27.

Definitely a pattern, but what's the formula? OmniCalculator tells me y = 0.5x^2 + 1.5x. Does that make any sense? And how would it change with 3 nested loops?

That's equivalent to (n+1) x (n+2) / 2 - 1 which seems more plausible as the number of different ways of reaching position n. Perhaps we can use (n+2)^p / 2 as a reasonable upper bound, where p is the maximum nesting depth of loops, and n is the current position?

Actions #6

Updated by Michael Kay 4 months ago

  • Status changed from New to Resolved
  • Applies to branch 12, trunk added
  • Fix Committed on Branch 12, trunk added
  • Platforms .NET, Java added

Fixed on 12 and main. Bug also present in earlier versions.

Actions #7

Updated by O'Neil Delpratt 3 months ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 12.5 added

Bug fix applied in the Saxon 12.5 Maintenance release.

Please register to edit this issue

Also available in: Atom PDF