How is tokenize($string, '\W+') evaluated?
Added by Martin Honnen over 4 years ago
The Saxon-JS 2 docs about regular expressions state that \w
will be evaluated the JavaScript way and not the XPath way in Saxon-JS.
Does that also hold for \W
?
When I run JavaScript '¿Cómo se escribe España?'.split(/\W+/g)
in Chrome I get an array ["", "C", "mo", "se", "escribe", "Espa", "a", ""]
When I run SaxonJS.XPath.evaluate("tokenize('¿Cómo se escribe España?', '\\W+')")
I get ["", "Cómo", "se", "escribe", "Esp", "ñ", ""]
.
On, the other hand, SaxonJS.XPath.evaluate("matches('España', '^\\w+$')")
gives true inside of Chrome.
Somehow the Saxon-JS results seems inconsistent. What is it doing with the word España
?
Replies (4)
Please register to reply
RE: How is tokenize($string, '\W+') evaluated? - Added by Martin Honnen over 4 years ago
It seems the letter a
is simply matched wrongly (I first thought some non-ASCII letters would cause this, but it seems to be unrelated to that), for instance
SaxonJS.XPath.evaluate('analyze-string("abcdefghijklmnopqrstuvwxyz", "\\W+")')
gives
<analyze-string-result><match>a</match><non-match>bcdefghijklmnopqrstuvwxyz</non-match></analyze-string-result>
It doesn't happen with an upper-case A
. Nor does SaxonJS.XPath.evaluate('analyze-string("abcdefghijklmnopqrstuvwxyz", "\\w+")')
show any anomaly.
RE: How is tokenize($string, '\W+') evaluated? - Added by Martin Honnen over 4 years ago
It seems to match \W
Saxon-JS 2 builds a regular expression including \u0061
which is a
. I haven't found anything defining \W
, it seems \w
is defined and the negation is computed from that but I haven't figured where/how that is done and why a
or \u0061
ends up in the regular expression.
RE: How is tokenize($string, '\W+') evaluated? - Added by Michael Kay over 4 years ago
I think the statement "\w has the JavaScript meaning rather than the XPath meaning." is out of date. Certainly the code attempts to construct an expansion for \w.
\W
should be the complement. However this doesn't seem to be working correctly. With the stylesheet
<out>
<a>{matches('a', '\w')}|{matches('a', '\W')}</a>
<b>{matches('b', '\w')}|{matches('b', '\W')}</b>
<c>{matches('c', '\w')}|{matches('c', '\W')}</c>
<A>{matches('A', '\w')}|{matches('A', '\W')}</A>
<B>{matches('B', '\w')}|{matches('B', '\W')}</B>
<C>{matches('C', '\w')}|{matches('C', '\W')}</C>
<ã>{matches('ã', '\w')}|{matches('ã', '\W')}</ã>
<Ã>{matches('Ã', '\w')}|{matches('Ã', '\W')}</Ã>
</out>
I get the output:
<out>
<a>true|true</a>
<b>true|false</b>
<c>true|false</c>
<A>true|false</A>
<B>true|false</B>
<C>true|false</C>
<ã>true|false</ã>
<Ã>true|false</Ã>
</out>
This is clearly incorrect.
Saxon-JS builds the data from a file categories.json which contains the data as a sequence of ranges; there's logic to form the complement of a set of ranges, for example if \x is (10-20, 30-40) then \X becomes (0-9, 21-29, 41-1114111). It looks from this evidence as if there might be an off-by-one bug in forming this complement.
RE: How is tokenize($string, '\W+') evaluated? - Added by Michael Kay over 4 years ago
Raised the bug at https://saxonica.plan.io/issues/4634
Please register to reply