Saxon 9.1 and regex
Added by Anonymous over 16 years ago
Legacy ID: #5162272 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)
Hello Mr Kay! I've found some feature of the Saxon that I think could be considered as a nonoptimal implemenration. I have a rather complex regex in xsl:analyze-string, which I put into regex attribute in hope for Saxon to prepare it statically. Consider and example: <xsl:analyze-string regex="\p{{Alphabetic}}" flags="imx" select="$expression"> It happens that the code regex = makeAttributeValueTemplate(regexAtt); in XSLAnalyzeString converts a literal string into a concat(...) of literals, as if it were untrivial AVT. This prevents early regex compilation and as result getRegexIterator() in AnalyzeString recompiles that regex too often. Thanks.
Replies (6)
Please register to reply
RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago
Legacy ID: #5162350 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)
A correction, <xsl:analyze-string regex="\p{{L}}" flags="imx" select="$expression">
RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago
Legacy ID: #5163906 Legacy Poster: Michael Kay (mhkay)
Could you please illustrate a complete stylesheet with -explain output? Don't be misled by the fact that the initial code generated contains things like concat() and string-join(). These should be optimized in subsequent phases by the compiler. Michael Kay Saxonica
RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago
Legacy ID: #5164775 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)
This is probably too large for an example, but it illustrates what's happening. Please see: http://www.nesterovsky-bros.com/download/xslt/expression-parser.xslt in particular p:tokenize-expression() function. Thanks.
RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago
Legacy ID: #5164862 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)
I've rechecked the code. concat() is indeed optimized into literal string, however the decision is already made: AnalyzeString is constructed without precompiled pattern. Thus in the following code Saxon repeatedly recompiles regex: private RegexIterator getRegexIterator(XPathContext context) throws XPathException { CharSequence input = select.evaluateAsString(context); RegularExpression re = pattern; if (re == null) { CharSequence flagstr = flags.evaluateAsString(context); final Platform platform = Configuration.getPlatform(); final int xmlVersion = context.getConfiguration().getXMLVersion(); re = platform.compileRegularExpression( regex.evaluateAsString(context), xmlVersion, RegularExpression.XPATH_SYNTAX, flagstr); if (re.matches("")) { dynamicError("The regular expression must not be one that matches a zero-length string", "XTDE1150", context); } } return re.analyze(input); }
RE: Saxon 9.1 and regex - Added by Anonymous over 16 years ago
Legacy ID: #5166699 Legacy Poster: Michael Kay (mhkay)
Thanks, I will look at this on my return from vacation. Michael Kay http://www.saxonica.com/
RE: Saxon 9.1 and regex - Added by Anonymous about 16 years ago
Legacy ID: #5201172 Legacy Poster: Michael Kay (mhkay)
Thanks, you are quite right. I don't normally issue patches for performance improvements but I decided to make an exception in this case. See https://sourceforge.net/tracker/index.php?func=detail&aid=2079053&group_id=29872&atid=397617. The patch affects module AnalyzeString.java - as well as handling the case that you identified where the regex contains doubled curly braces, it handles any other case where the AVT has been reduced to a constant string by the end of the optimization phase. I'm a bit puzzled by this one because I distinctly remember looking at the case where the user writes <xsl:variable name="regex">xxxxxx</xsl:variable> <xsl:analyze-string regex="{$regex}"> and ensuring that this was optimized correctly. But I can't find any trace of the code to do that. I've checked however that with the patch this case is now handled.
Please register to reply