Bug #3712
closedSimple regular expression not matching when it should
100%
Description
The regular expression ([A-Z])\1*
Doesn't match the string 'A'
, but does match 'AA'
. In all other Regex implementations I've tried, (Ruby, Python, JS), it works as expected.
Seen in Oxygen using 9.7 EE and 9.8 PE, and using XSpec with 9.7 EE and the Ant-based runner in Oxygen.
Files
Updated by Matt Patterson over 6 years ago
Apologies for the weird backslash-wrapped stuff in the description - I'd assumed Markdown without checking.
The regular expression ([A-Z])\1*
doesn't match the string 'A'
, but does match 'AA'
. It should match both.
Updated by Michael Kay over 6 years ago
- Category set to XPath conformance
- Status changed from New to In Progress
- Assignee set to Michael Kay
- Priority changed from Low to Normal
- Applies to branch trunk added
- Applies to branch deleted (
9.7)
Reproduced. Test case fn-matches-53 added to QT3 test suite.
Updated by Michael Kay over 6 years ago
In Operation.OpRepeat, we turn (X)*
into (X)+
if X is capable of matching a zero-length string, to reduce the problem of repeatedly matching a zero-length string without advancing the current position. We classify a back-reference as being capable of matching a zero-length string. But the rewrite is incorrect in this case, because this particular back-reference is not capable of matching a zero-length string. (Clearly, ([A-Z])\1+
does not match the input string @A@.)
Updated by Michael Kay over 6 years ago
Tricky one. I think the back-reference is the only construct where we can't tell statically whether it matches a zero-length-string or not.
If we change OpBackReference.matchesEmptyString() to return false, then this test case works, but it's hard to convince oneself that this change is safe.
The trickiest case is where the input string to matches() is zero-length, we then assume that we have been able to work out statically whether the regular expression matches empty strings or not. But for an expression like (A?)\1
we don't know this statically, because we don't do any static matching-up of the back-reference \1 to the expression (A?). Perhaps we should: if we did, then matchesEmptyString() would be true for a back-reference iff it's true for the expression that it refers to.
Updated by Michael Kay over 6 years ago
Looking at the usages of matchesEmptyString(), I think changing OpBackReference.matchesEmptyString() to return false is safe for all usages except the top-level RegularExpression.isNullable(), which relies solely on the static analysis to decide whether the expression matches a zero-length string in its entirety (i.e. with implicit anchoring). This is used in particular (a) when evaluating the pattern facet against an empty string in XSD, and (b) when assessing the rule in various XPath expressions (e.g. tokenize()) that "the regular expression must not be one that matches a zero-length string".
The XSD case is not affected because XSD doesn't use back-references, but case (b) is certainly relevant.
Given a regular expression that matches a sequence (OpSequence), e.g. ABC*, we return matchesEmptyString()=true only if each subexpression returns matchesEmptyString()=true. This means that if one of the subexpressions is a back-reference, we will now return false, even for an expression such as (A?)\1, which means we will not prevent such an expression being used in functions such as tokenize().
Sure enough, tokenize('ABCD', '(A?)\1')
now goes into an infinite loop, rather than raising an error.
I think the answer to this is that a positive response from matchesEmptyString() is sufficient to determine that a regex matches "", but a negative response is not sufficient to establish that it doesn't. So the method ARegularExpression.matches(CharSequence), which is the only place that calls regex.isNullable(), needs to change from
if (StringValue.isEmpty(input)) { return regex.isNullable(); }
to
if (StringValue.isEmpty(input) && regex.isNullable() { return true; }
The other remaining doubt is whether we can go into an infinite loop if we don't do the optimization from A* to A+ where A can match an empty string. Testing this with
matches('ABCD', '(A?)\1*')
suggests there's no problem. There are other mechanisms in place to ensure that we don't try to match a zero-length string an infinite number of times without advancing the position.
Updated by Michael Kay over 6 years ago
- Status changed from In Progress to Resolved
- Fix Committed on Branch 9.8, trunk added
Resolved on the 9.8 and trunk branches.
-
For back-references, matchesEmptyString() returns false
-
We clarify that matchesEmptyString() returning false means only that it can't be statically determined whether the regex matches an empty string
-
paths that call this method are adjusted accordingly.
Updated by O'Neil Delpratt over 6 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 9.8.0.10 added
Bug fix applied in the Saxon 9.8.0.10 maintenance release.
Please register to edit this issue