regex \b not implemented?
Added by Anonymous over 19 years ago
Legacy ID: #3185483 Legacy Poster: Pat Cappelaere (cappelaere)
I am trying to match whole words using the match function and a reg expression containing \bword\b. This returns an error invalid escape sequence. Looks like it is not implemented? Thanks, Pat.
Replies (3)
Please register to reply
RE: regex \b not implemented? - Added by Anonymous over 19 years ago
Legacy ID: #3185668 Legacy Poster: Michael Kay (mhkay)
\b isn't available in the regex language defined in XML Schema or XPath (so Saxon is implementing the spec correctly). I imagine the reason that XML Schema excluded this construct (which in Perl is defined as "word boundary") is that the concept of a word boundary depends very much on what natural language you are using, and they didn't want the meaning of a regular expression to be context-dependent. Use something like \s+ instead. Michael Kay http://www.saxonica.com/
RE: regex \b not implemented? - Added by Anonymous over 19 years ago
Legacy ID: #3186681 Legacy Poster: Pat Cappelaere (cappelaere)
True. It does look like XML Schema does not specify \b as word boundary. Hummmm! XML Spy XQuery seems to support it and it is described in the XQuery Kick Start book (J.McGovern&al). \s+ is really not quite the same thing as it does require spaces. This will fail in many instances... would this be a big deal to add? Just wondering :) Thanks again. Pat.
RE: regex \b not implemented? - Added by Anonymous over 19 years ago
Legacy ID: #3186703 Legacy Poster: Michael Kay (mhkay)
As I said before, I suspect the reason for the omission of \b is that the meaning of "word break" varies from one natural language to another, and neither the Schema nor the XSLT/XQuery groups wanted to make regular expressions depend on the language of the text (or of the user). There's little point raising comments on the language specification here: you could raise it as a public comment on the spec, but I'm pretty sure what response you would get. There are three possible definitions: (a) define it as matching a word break appropriate to English-language text (b) define it as matching a word break appropriate to the language that the text is written in (c) define it as matching a word break appropriate to the current user's locale. I don't think any of these definitions has any chance of being acceptable to the WGs. I can't comment on errors in other vendors' products or other authors' books. My own product, and my own XPath book, handle this correctly according to the W3C specifications. Michael Kay
Please register to reply