Bug #3387
closedInverse character ranges in regular expressions
100%
Description
I think the logic for handling inverse character ranges such as \P{L} may be incorrect.
For two-letter categories the logic is sound. The categories.json file gives a definition of Ll as
[["61","7A"],["B5","B5"],["DF","F6"],...
and to invert this we form the ranges corresponding to the gaps:
["0","60"],["7B","B4"],["B6","DE"],...
This only works if the ranges are in ascending order. For single-character categories such as L, we concatenate the subcategories, and the result is therefore the union of the gaps in the subcategories, when it should be the gaps in the union of the subcategories.
Oddly, I can't point my finger at tests that are failing as a result.
Updated by Michael Kay over 7 years ago
The problem can be demonstrated by making a variant of test regex-syntax-091 as follows:
<test-case name="regex-syntax-0191b">
<description>see regex-syntax-0001</description>
<created by="Michael Kay" on="2012-11-07"/>
<environment ref="regex-syntax"/>
<test>
<param name="regex" as="xs:string" select="'\P{P}*'"/>
<param name="match" as="xs:string" select="'²'"/>
<param name="nonmatch" as="xs:string" select="'«'"/>
<param name="delimiter" as="xs:string" select="','"/>
<initial-template name="go"/>
</test>
<result>
<assert>/true</assert>
</result>
</test-case>
The character xAB matches category Pi; therefore it matches P; therefore it should not match \P{P}. But it does match, because it falls into one of the gaps in the Pc subcategory.
Updated by Michael Kay about 7 years ago
- Status changed from New to Resolved
- Fix Committed on JS Branch Trunk added
Fixed on the trunk source. We now manage sets of integer ranges as a nested array [[1,3], [5,6]] rather than a flattened array [1,3,5,6]; this makes it easy to sort the ranges into order before computing the inverse ranges.
Note, from a performance perspective, having computed the expansion of something like \W once, we should probably remember it for future use: at present we are repeating the work every time we parse a regular expression containing this construct.
Updated by Debbie Lockett over 4 years ago
- Description updated (diff)
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in JS Release set to Saxon-JS 2.0
Please register to edit this issue
Also available in: Atom PDF Tracking page