How is tokenize($string, '\W+') evaluated?

The Saxon-JS 2 docs about regular expressions state that \w will be evaluated the JavaScript way and not the XPath way in Saxon-JS.

Does that also hold for \W?

When I run JavaScript '¿Cómo se escribe España?'.split(/\W+/g) in Chrome I get an array ["", "C", "mo", "se", "escribe", "Espa", "a", ""]

When I run SaxonJS.XPath.evaluate("tokenize('¿Cómo se escribe España?', '\\W+')") I get ["", "Cómo", "se", "escribe", "Esp", "ñ", ""].

On, the other hand, SaxonJS.XPath.evaluate("matches('España', '^\\w+$')") gives true inside of Chrome.

Somehow the Saxon-JS results seems inconsistent. What is it doing with the word España?

Replies (4)

Please register to reply

RE: How is tokenize($string, '\W+') evaluated? - Added by Martin Honnen over 4 years ago

It seems the letter a is simply matched wrongly (I first thought some non-ASCII letters would cause this, but it seems to be unrelated to that), for instance

SaxonJS.XPath.evaluate('analyze-string("abcdefghijklmnopqrstuvwxyz", "\\W+")')

gives

<analyze-string-result><match>a</match><non-match>bcdefghijklmnopqrstuvwxyz</non-match></analyze-string-result>

It doesn't happen with an upper-case A. Nor does SaxonJS.XPath.evaluate('analyze-string("abcdefghijklmnopqrstuvwxyz", "\\w+")') show any anomaly.

RE: How is tokenize($string, '\W+') evaluated? - Added by Martin Honnen over 4 years ago

It seems to match \W Saxon-JS 2 builds a regular expression including \u0061 which is a. I haven't found anything defining \W, it seems \w is defined and the negation is computed from that but I haven't figured where/how that is done and why a or \u0061 ends up in the regular expression.

RE: How is tokenize($string, '\W+') evaluated? - Added by Michael Kay over 4 years ago

I think the statement "\w has the JavaScript meaning rather than the XPath meaning." is out of date. Certainly the code attempts to construct an expansion for \w.

\W should be the complement. However this doesn't seem to be working correctly. With the stylesheet

<out>
      <a>{matches('a', '\w')}|{matches('a', '\W')}</a>
      <b>{matches('b', '\w')}|{matches('b', '\W')}</b>
      <c>{matches('c', '\w')}|{matches('c', '\W')}</c>
      <A>{matches('A', '\w')}|{matches('A', '\W')}</A>
      <B>{matches('B', '\w')}|{matches('B', '\W')}</B>
      <C>{matches('C', '\w')}|{matches('C', '\W')}</C>
      <ã>{matches('ã', '\w')}|{matches('ã', '\W')}</ã>
      <Ã>{matches('Ã', '\w')}|{matches('Ã', '\W')}</Ã>
    </out>

I get the output:

<out>
   <a>true|true</a>
   <b>true|false</b>
   <c>true|false</c>
   <A>true|false</A>
   <B>true|false</B>
   <C>true|false</C>
   <ã>true|false</ã>
   <Ã>true|false</Ã>
</out>

This is clearly incorrect.

Saxon-JS builds the data from a file categories.json which contains the data as a sequence of ranges; there's logic to form the complement of a set of ranges, for example if \x is (10-20, 30-40) then \X becomes (0-9, 21-29, 41-1114111). It looks from this evidence as if there might be an off-by-one bug in forming this complement.

RE: How is tokenize($string, '\W+') evaluated? - Added by Michael Kay over 4 years ago

Raised the bug at https://saxonica.plan.io/issues/4634

(1-4/4)

Please register to reply

Project

Profile

Help

Saxon » SaxonJS