Project

Profile

Help

Saxon 9.1 and regex

Added by Anonymous over 16 years ago

Legacy ID: #5162272 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Hello Mr Kay! I've found some feature of the Saxon that I think could be considered as a nonoptimal implemenration. I have a rather complex regex in xsl:analyze-string, which I put into regex attribute in hope for Saxon to prepare it statically. Consider and example: <xsl:analyze-string regex="\p{{Alphabetic}}" flags="imx" select="$expression"> It happens that the code regex = makeAttributeValueTemplate(regexAtt); in XSLAnalyzeString converts a literal string into a concat(...) of literals, as if it were untrivial AVT. This prevents early regex compilation and as result getRegexIterator() in AnalyzeString recompiles that regex too often. Thanks.


Replies (6)

Please register to reply

RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago

Legacy ID: #5162350 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

A correction, <xsl:analyze-string regex="\p{{L}}" flags="imx" select="$expression">

RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago

Legacy ID: #5163906 Legacy Poster: Michael Kay (mhkay)

Could you please illustrate a complete stylesheet with -explain output? Don't be misled by the fact that the initial code generated contains things like concat() and string-join(). These should be optimized in subsequent phases by the compiler. Michael Kay Saxonica

RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago

Legacy ID: #5164775 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

This is probably too large for an example, but it illustrates what's happening. Please see: http://www.nesterovsky-bros.com/download/xslt/expression-parser.xslt in particular p:tokenize-expression() function. Thanks.

RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago

Legacy ID: #5164862 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I've rechecked the code. concat() is indeed optimized into literal string, however the decision is already made: AnalyzeString is constructed without precompiled pattern. Thus in the following code Saxon repeatedly recompiles regex: private RegexIterator getRegexIterator(XPathContext context) throws XPathException { CharSequence input = select.evaluateAsString(context); RegularExpression re = pattern; if (re == null) { CharSequence flagstr = flags.evaluateAsString(context); final Platform platform = Configuration.getPlatform(); final int xmlVersion = context.getConfiguration().getXMLVersion(); re = platform.compileRegularExpression( regex.evaluateAsString(context), xmlVersion, RegularExpression.XPATH_SYNTAX, flagstr); if (re.matches("")) { dynamicError("The regular expression must not be one that matches a zero-length string", "XTDE1150", context); } } return re.analyze(input); }

RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago

Legacy ID: #5166699 Legacy Poster: Michael Kay (mhkay)

Thanks, I will look at this on my return from vacation. Michael Kay http://www.saxonica.com/

RE: Saxon 9.1 and regex - Added by Anonymous about 16 years ago

Legacy ID: #5201172 Legacy Poster: Michael Kay (mhkay)

Thanks, you are quite right. I don't normally issue patches for performance improvements but I decided to make an exception in this case. See https://sourceforge.net/tracker/index.php?func=detail&amp;aid=2079053&amp;group_id=29872&amp;atid=397617. The patch affects module AnalyzeString.java - as well as handling the case that you identified where the regex contains doubled curly braces, it handles any other case where the AVT has been reduced to a constant string by the end of the optimization phase. I'm a bit puzzled by this one because I distinctly remember looking at the case where the user writes <xsl:variable name="regex">xxxxxx</xsl:variable> <xsl:analyze-string regex="{$regex}"> and ensuring that this was optimized correctly. But I can't find any trace of the code to do that. I've checked however that with the patch this case is now handled.

    (1-6/6)

    Please register to reply