Bug #3712: Simple regular expression not matching when it should - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #3712

closed

Simple regular expression not matching when it should

Added by Matt Patterson over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

XPath conformance

Sprint/Milestone:

Start date:

2018-03-06

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.8, trunk

Fix Committed on Branch:

9.8, trunk

Fixed in Maintenance Release:

9.8.0.10

Platforms:

Description

The regular expression ([A-Z])\1* Doesn't match the string 'A', but does match 'AA'. In all other Regex implementations I've tried, (Ruby, Python, JS), it works as expected.

Seen in Oxygen using 9.7 EE and 9.8 PE, and using XSpec with 9.7 EE and the Ant-based runner in Oxygen.

Files

saxon-regex-bug.xsl (370 Bytes) saxon-regex-bug.xsl

Matt Patterson, 2018-03-06 17:26

Actions

Copy link

Updated by Matt Patterson over 6 years ago

Apologies for the weird backslash-wrapped stuff in the description - I'd assumed Markdown without checking.

The regular expression ([A-Z])\1* doesn't match the string 'A', but does match 'AA'. It should match both.

Actions

Copy link

Updated by Michael Kay over 6 years ago

Category set to XPath conformance
Status changed from New to In Progress
Assignee set to Michael Kay
Priority changed from Low to Normal
Applies to branch trunk added
Applies to branch deleted (~~9.7~~)

Reproduced. Test case fn-matches-53 added to QT3 test suite.

Actions

Copy link

Updated by Michael Kay over 6 years ago

In Operation.OpRepeat, we turn (X)* into (X)+ if X is capable of matching a zero-length string, to reduce the problem of repeatedly matching a zero-length string without advancing the current position. We classify a back-reference as being capable of matching a zero-length string. But the rewrite is incorrect in this case, because this particular back-reference is not capable of matching a zero-length string. (Clearly, ([A-Z])\1+ does not match the input string @A@.)

Actions

Copy link

Updated by Michael Kay over 6 years ago

Tricky one. I think the back-reference is the only construct where we can't tell statically whether it matches a zero-length-string or not.

If we change OpBackReference.matchesEmptyString() to return false, then this test case works, but it's hard to convince oneself that this change is safe.

The trickiest case is where the input string to matches() is zero-length, we then assume that we have been able to work out statically whether the regular expression matches empty strings or not. But for an expression like (A?)\1 we don't know this statically, because we don't do any static matching-up of the back-reference \1 to the expression (A?). Perhaps we should: if we did, then matchesEmptyString() would be true for a back-reference iff it's true for the expression that it refers to.

Actions

Copy link

Updated by Michael Kay over 6 years ago

Looking at the usages of matchesEmptyString(), I think changing OpBackReference.matchesEmptyString() to return false is safe for all usages except the top-level RegularExpression.isNullable(), which relies solely on the static analysis to decide whether the expression matches a zero-length string in its entirety (i.e. with implicit anchoring). This is used in particular (a) when evaluating the pattern facet against an empty string in XSD, and (b) when assessing the rule in various XPath expressions (e.g. tokenize()) that "the regular expression must not be one that matches a zero-length string".

The XSD case is not affected because XSD doesn't use back-references, but case (b) is certainly relevant.

Given a regular expression that matches a sequence (OpSequence), e.g. ABC*, we return matchesEmptyString()=true only if each subexpression returns matchesEmptyString()=true. This means that if one of the subexpressions is a back-reference, we will now return false, even for an expression such as (A?)\1, which means we will not prevent such an expression being used in functions such as tokenize().

Sure enough, tokenize('ABCD', '(A?)\1') now goes into an infinite loop, rather than raising an error.

I think the answer to this is that a positive response from matchesEmptyString() is sufficient to determine that a regex matches "", but a negative response is not sufficient to establish that it doesn't. So the method ARegularExpression.matches(CharSequence), which is the only place that calls regex.isNullable(), needs to change from

if (StringValue.isEmpty(input)) { return regex.isNullable(); }

if (StringValue.isEmpty(input) && regex.isNullable() { return true; }

The other remaining doubt is whether we can go into an infinite loop if we don't do the optimization from A* to A+ where A can match an empty string. Testing this with

matches('ABCD', '(A?)\1*')

suggests there's no problem. There are other mechanisms in place to ensure that we don't try to match a zero-length string an infinite number of times without advancing the position.

Actions

Copy link

Updated by Michael Kay over 6 years ago

Status changed from In Progress to Resolved
Fix Committed on Branch 9.8, trunk added

Resolved on the 9.8 and trunk branches.

For back-references, matchesEmptyString() returns false
We clarify that matchesEmptyString() returning false means only that it can't be statically determined whether the regex matches an empty string
paths that call this method are adjusted accordingly.

Actions

Copy link

Updated by O'Neil Delpratt over 6 years ago

Status changed from Resolved to Closed
% Done changed from 0 to 100
Fixed in Maintenance Release 9.8.0.10 added

Bug fix applied in the Saxon 9.8.0.10 maintenance release.

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #3712

Simple regular expression not matching when it should

Updated by Matt Patterson over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by Michael Kay over 6 years ago

Updated by O'Neil Delpratt over 6 years ago