Bug #2487
closedIncorrect value returned by match-substring/regex-group()
100%
Description
I'm debugging someone else's code that previously worked correctly. The problem is illustrated in the following XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl"
exclude-result-prefixes="xs xd"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:analyze-string select="'1949'" regex="([0-9]{{1,2}})?\s?([A-Z]{{1}}[a-z]{{1,8}}\.?)?\s?([0-9]{{4}})">
<xsl:matching-substring>
<xsl:text>regex-group(1)=</xsl:text><xsl:value-of select="regex-group(1)"/><xsl:text>
</xsl:text>
<xsl:text>regex-group(2)=</xsl:text><xsl:value-of select="regex-group(2)"/><xsl:text>
</xsl:text>
<xsl:text>regex-group(3)=</xsl:text><xsl:value-of select="regex-group(3)"/><xsl:text>
</xsl:text>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
Where I would expect regex-group(1) to be empty, but instead it is getting the value '1'. This is in both saxon 9.6.0.5 and 9.6.0.7 (versions bundled with oXygen)
See also http://stackoverflow.com/questions/33342200/why-does-this-return-a-value-for-regex-group1
Updated by Tomos Hillman about 9 years ago
I've got a colleague to test using an older version of saxon (9.5.1.7); she gets the expected result (empty regex-group(1)).
Updated by Tomos Hillman about 9 years ago
I also get it with the replace() function:
replace('1949', '([0-9]{1,2})?\s?([A-Z]{1}[a-z]{1,8}.?)?\s?([0-9]{4})', '$1')
Updated by Michael Kay about 9 years ago
- Category set to XPath conformance
- Status changed from New to In Progress
- Assignee set to Michael Kay
- Priority changed from Low to Normal
Added (slightly adapted) to W3C XSLT3 test suite as analyze-string-095.
Current results from Saxon 9.7
regex-group(1)=1
regex-group(2)=
regex-group(3)=1949
Updated by Michael Kay about 9 years ago
This is one of those areas where the specifications for regular expressions are frustratingly unhelpful. Most of them say much the same as the F+O spec (which is the normative one here); "When parentheses are used in a part of the regular expression that is matched more than once (because it is within a construct that allows repetition), then only the last substring that it matched will be captured.". What the spec doesn't say, but what it probably should say, is that the only matches that count are those that are ultimately successful, that is, if you end up ignoring a match of part of the regex, as a result of backtracking, then the contents of any subgroups captured during that attempt should also be discarded. This seems to be what other regex engines do, and getting regular expression evaluation right often seems to be guided more by the actual behaviour of other engines than by any specification...
Saxon is doing what the F+O spec says, and not what it should say.
The backtracking logic was extensively rewritten for Saxon 9.6 to prevent stack overflows, and it is no doubt this rewrite that led to the regression.
There is no logic in the regex engine that attempts to clear captured subgroups during backtracking: a subgroup gets overwritten each time the capturing group matches something, but it doesn't get discarded if the backtracking unwinds beyond that part of the expression. So fixing this isn't going to be particularly easy. But I'm thinking about it. (I always had an uneasy feeling there might be a problem here, but didn't worry too much about it as all tests were passing.)
Updated by Michael Kay about 9 years ago
Turns out to be not as bad as I thought. The regex engine DOES have logic to clear "provisional" captured groups during backtracking; it just isn't invoking that logic on this particular path.
My first attempt at a patch, however, causes regression in one XQuery test case doing
analyze-string("how now brown cow", "(.*?ow\s+)+", "")
Updated by Michael Kay about 9 years ago
- Status changed from In Progress to Resolved
Now fixed: patch committed to Subversion on the 9.6 and 9.7 branches (module net.sf.saxon.regex.Operation).
The patch adds a call to matcher.clearCapturedSubgroupsBeyond(position) when the OpSequence.advance() operator returns -1, indicating that the attempt to match a sequence of terms has failed; the effect is that all captured subgroups before the position at which this matching attempt started are cleared.
All regex test cases in XSLT3TS and QT3 passing.
Updated by Michael Kay about 9 years ago
- Found in version changed from 9.6.0.7 to 9.6
Updated by O'Neil Delpratt about 9 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in version set to 9.6.0.8
Bug fix applied in the Saxon 9.6.0.8 maintenance release
Updated by O'Neil Delpratt about 9 years ago
- Applies to branch 9.6 added
- Fix Committed on Branch 9.6 added
- Fixed in Maintenance Release 9.6.0.8 added
Please register to edit this issue