Dynamic Regex Performance in Saxon
Added by Anonymous over 15 years ago
Legacy ID: #7386618 Legacy Poster: Lea Hayes (numberkruncher)
Hi, I am working on an XPointer implementation in XSLT and to split schemes for a URI I am using the following regular expression "\s*(.?)((((|.?[^^]))|.)*?)". This expression can only handle a maximum of 2 levels of parenthesis nesting. At the moment I am dynamically generating a regular expression which offers at least one level per opening parenthesis in the input URI. Whilst this is working I am concerned about performance problems because: - A URI which contains the fragment: "#scheme(a(b)(c)(d))scheme(d)" only needs a maximum parenthesis depth of 2, but it is getting a regex for 5. This is probably fine for a short fragment like this, but a larger one could end up having lots of opening parenthesis. - I also fear that if Saxon caches regular expressions that this approach could lead to some serious problems. If I could determine the maximum parenthesis depth using the powers of XSLT then this would ease my concerns because the length of the generated regex would not be excessive. The only alternatives I can think of are to either a) limit the maximum parenthesis depth, or b) find another way to split the XPointer fragment into an array of schemes. Many thanks, Lea Hayes
Replies (3)
Please register to reply
RE: Dynamic Regex Performance in Saxon - Added by Anonymous over 15 years ago
Legacy ID: #7387182 Legacy Poster: Michael Kay (mhkay)
Using regular expressions in this way to parse a recursive grammar is just plain wrong. You're asking how to optimize incorrect code. You need to handle this input properly by writing a recursive top-down parser. Dimitre Novatchev has some examples of how to write parsers in XSLT.
RE: Dynamic Regex Performance in Saxon - Added by Anonymous over 15 years ago
Legacy ID: #7388225 Legacy Poster: Lea Hayes (numberkruncher)
I was not happy with this approach, but it was the only way that I could think of splitting this string. The idea of writing a simple recursive descent parser using XSLT had not crossed my mind. I really should have thought about doing this from the beginning. I have not fully tested this yet, but I think that something like below should do the trick. This simply splits an XPointer fragment into separate schemes which is compatible with my previous XPointer code. Thanks! Lea Hayes <xsl:function name="doc:xptr-fragment"> <xsl:param name="fragment"/> <xsl:sequence select="doc:xptr-scheme($fragment)"/> </xsl:function> <xsl:function name="doc:xptr-scheme-name"> <xsl:param name="fragment"/> <xsl:analyze-string select="$fragment" regex="^\s*([^(]+)("> <xsl:matching-substring> <xsl:value-of select="regex-group(1)"/> </xsl:matching-substring> </xsl:analyze-string> </xsl:function> <xsl:function name="doc:xptr-scheme"> <xsl:param name="fragment"/> <!-- Read scheme name, only proceed if a valid one is found! --> <xsl:variable name="scheme-name" select="doc:xptr-scheme-name($fragment)"/> <xsl:if test="string-length($scheme-name)>0"> <xsl:variable name="remain-content" select="substring($fragment,string-length($scheme-name)+2)"/> <xsl:variable name="scheme-content" select="string-join(doc:xptr-scheme-data($remain-content),'')"/> <doc:scheme type="{$scheme-name}"> <xsl:value-of select="$scheme-content"/> </doc:scheme> <xsl:sequence select="doc:xptr-scheme(substring($remain-content,string-length($scheme-content)+2))"/> </xsl:if> </xsl:function> <xsl:function name="doc:xptr-scheme-data"> <xsl:param name="fragment"/> <!-- Read scheme data... --> <xsl:choose> <xsl:when test="starts-with($fragment,'(')"> <xsl:variable name="content" select="string-join(doc:xptr-scheme-data(substring($fragment,2)),'')"/> <xsl:text>(</xsl:text> <xsl:value-of select="$content"/> <xsl:text>)</xsl:text> <xsl:value-of select="doc:xptr-scheme-data(substring($fragment,string-length($content)+2))"/> </xsl:when> <xsl:when test="not(starts-with($fragment,')'))"> <xsl:variable name="content"> <xsl:analyze-string select="$fragment" regex="^([^()^]|^.)+"> <xsl:matching-substring> <xsl:value-of select="."/> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:value-of select="$content"/> <xsl:variable name="remain-fragment" select="substring($fragment,string-length($content)+1)"/> <xsl:if test="not(starts-with($remain-fragment,')'))"> <xsl:value-of select="doc:xptr-scheme-data($remain-fragment)"/> </xsl:if> </xsl:when> </xsl:choose> </xsl:function> <!-- A simple test element --> <xsl:template match="doc:TEST"> <xsl:sequence select="doc:xptr-fragment('scheme1(a(b^)))scheme2(b(d(e^)f)))')"/> </xsl:template>
RE: Dynamic Regex Performance in Saxon - Added by Anonymous over 15 years ago
Legacy ID: #7388238 Legacy Poster: Michael Kay (mhkay)
Another interesting way to do this would be to replace "(" and ")" by "<a>" and "</a>" respectively, put the result through saxon:parse(), and then process the resulting tree using XPath. (Not quite as easy as that of course, because of special characters).
Please register to reply