Project

Profile

Help

Bug #3712

closed

Simple regular expression not matching when it should

Added by Matt Patterson about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2018-03-06
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8, trunk
Fix Committed on Branch:
9.8, trunk
Fixed in Maintenance Release:
Platforms:

Description

The regular expression ([A-Z])\1* Doesn't match the string 'A', but does match 'AA'. In all other Regex implementations I've tried, (Ruby, Python, JS), it works as expected.

Seen in Oxygen using 9.7 EE and 9.8 PE, and using XSpec with 9.7 EE and the Ant-based runner in Oxygen.


Files

saxon-regex-bug.xsl (370 Bytes) saxon-regex-bug.xsl Matt Patterson, 2018-03-06 17:26
Actions #1

Updated by Matt Patterson about 6 years ago

Apologies for the weird backslash-wrapped stuff in the description - I'd assumed Markdown without checking.

The regular expression ([A-Z])\1* doesn't match the string 'A', but does match 'AA'. It should match both.

Actions #2

Updated by Michael Kay about 6 years ago

  • Category set to XPath conformance
  • Status changed from New to In Progress
  • Assignee set to Michael Kay
  • Priority changed from Low to Normal
  • Applies to branch trunk added
  • Applies to branch deleted (9.7)

Reproduced. Test case fn-matches-53 added to QT3 test suite.

Actions #3

Updated by Michael Kay about 6 years ago

In Operation.OpRepeat, we turn (X)* into (X)+ if X is capable of matching a zero-length string, to reduce the problem of repeatedly matching a zero-length string without advancing the current position. We classify a back-reference as being capable of matching a zero-length string. But the rewrite is incorrect in this case, because this particular back-reference is not capable of matching a zero-length string. (Clearly, ([A-Z])\1+ does not match the input string @A@.)

Actions #4

Updated by Michael Kay about 6 years ago

Tricky one. I think the back-reference is the only construct where we can't tell statically whether it matches a zero-length-string or not.

If we change OpBackReference.matchesEmptyString() to return false, then this test case works, but it's hard to convince oneself that this change is safe.

The trickiest case is where the input string to matches() is zero-length, we then assume that we have been able to work out statically whether the regular expression matches empty strings or not. But for an expression like (A?)\1 we don't know this statically, because we don't do any static matching-up of the back-reference \1 to the expression (A?). Perhaps we should: if we did, then matchesEmptyString() would be true for a back-reference iff it's true for the expression that it refers to.

Actions #5

Updated by Michael Kay about 6 years ago

Looking at the usages of matchesEmptyString(), I think changing OpBackReference.matchesEmptyString() to return false is safe for all usages except the top-level RegularExpression.isNullable(), which relies solely on the static analysis to decide whether the expression matches a zero-length string in its entirety (i.e. with implicit anchoring). This is used in particular (a) when evaluating the pattern facet against an empty string in XSD, and (b) when assessing the rule in various XPath expressions (e.g. tokenize()) that "the regular expression must not be one that matches a zero-length string".

The XSD case is not affected because XSD doesn't use back-references, but case (b) is certainly relevant.

Given a regular expression that matches a sequence (OpSequence), e.g. ABC*, we return matchesEmptyString()=true only if each subexpression returns matchesEmptyString()=true. This means that if one of the subexpressions is a back-reference, we will now return false, even for an expression such as (A?)\1, which means we will not prevent such an expression being used in functions such as tokenize().

Sure enough, tokenize('ABCD', '(A?)\1') now goes into an infinite loop, rather than raising an error.

I think the answer to this is that a positive response from matchesEmptyString() is sufficient to determine that a regex matches "", but a negative response is not sufficient to establish that it doesn't. So the method ARegularExpression.matches(CharSequence), which is the only place that calls regex.isNullable(), needs to change from

if (StringValue.isEmpty(input)) { return regex.isNullable(); }

to

if (StringValue.isEmpty(input) && regex.isNullable() { return true; }

The other remaining doubt is whether we can go into an infinite loop if we don't do the optimization from A* to A+ where A can match an empty string. Testing this with

matches('ABCD', '(A?)\1*')

suggests there's no problem. There are other mechanisms in place to ensure that we don't try to match a zero-length string an infinite number of times without advancing the position.

Actions #6

Updated by Michael Kay about 6 years ago

  • Status changed from In Progress to Resolved
  • Fix Committed on Branch 9.8, trunk added

Resolved on the 9.8 and trunk branches.

  • For back-references, matchesEmptyString() returns false

  • We clarify that matchesEmptyString() returning false means only that it can't be statically determined whether the regex matches an empty string

  • paths that call this method are adjusted accordingly.

Actions #7

Updated by O'Neil Delpratt about 6 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.8.0.10 added

Bug fix applied in the Saxon 9.8.0.10 maintenance release.

Please register to edit this issue

Also available in: Atom PDF