Problem with analyze-string

When trying to use the @analyze-string@ function with Saxon (Saxon-EE 9.7.0.15J) I get a different result compared to BaseX or Altova. The input sample is




	/OPDH/FLOWING SOLUTION/SGDE/Number0983713/EKPH/Sample test/some other keys/
	/some other keys/afdsf/SGDE/Number0983713/some other keys/PIHSAGA/OPDH/FLOWING SOLUTION/some other keys/No exception/EKPH/Sample test/some other keys/

a sample XQuery is

xquery version "3.1";
//DATA!analyze-string(., '(/OPDH/|/EKPH/|/SGDE/|/some other keys/)(.*?)(/OPDH/|/EKPH/|/SGDE/|/some other keys/)((.*?)(/OPDH/|/EKPH/|/SGDE/|/some other keys/))*')

then the result I get with BaseX is



  
    /OPDH/
    FLOWING SOLUTION
    /SGDE/Number0983713/EKPH/
      Sample test
      /some other keys/
    
  


  
    /some other keys/
    afdsf
    /SGDE/Number0983713/some other keys/PIHSAGA/OPDH/FLOWING SOLUTION/some other keys/No exception/EKPH/
      Sample test
      /some other keys/

while Saxon returns a @group nr="5"@ after a @group nr="6"@:



   
      /OPDH/
      FLOWING SOLUTION
      /SGDE/Number0983713/EKPH/Sample test/some other keys/
         
      
   


   
      /some other keys/
      afdsf
      /SGDE/Number0983713/some other keys/PIHSAGA/OPDH/FLOWING SOLUTION/some other keys/No exception/EKPH/Sample test/some other keys/

The Saxon result seems wrong to me.

Replies (2)

RE: Problem with analyze-string - Added by Michael Kay over 7 years ago

Yes, the Saxon result certainly feels wrong, though the spec isn't very precise in this area (and in fact we deliberately left it imprecise because we found that different regex libraries were handling edge cases differently and we didn't want to constrain implementations excessively in such cases.)

The spec says this:

When a ·capturing sub-expression· is matched more than once (because it is within a construct that allows repetition), then only the last substring that it matched will be captured. Note that this rule is not sufficient in all cases to ensure an unambiguous result, especially in cases where (a) the regular expression contains nested repeating constructs, and/or (b) the repeating construct matches a zero-length string. In such cases it is implementation-dependent which substring is captured. For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup.

But what it fails to say is that "the last substring that it matched" should be interpreted as "the last substring that it matched in the course of a successful match of all enclosing subexpressions". In other words, if you're attempting a match of group 4, and during the attempt you find a match for group 5, but the attempt to match group 4 subsequently fails, then you need to rollback the value of group 5 to what it was before. Saxon isn't doing that.

Saxon does have some logic to clear captured groups that were speculatively matched, but it's not effective in this case (a) because when the attempt to match group 4 at character 75 (the end of the string) fails, it only tries to clear group 5 if it is beyond character 75, but actually group 5 is the zero length string AT position 75 so it doesn't meet the conditions; and (b) clearing the group (setting it to null) isn't enough; it needs to be reset to its previous value.

RE: Problem with analyze-string - Added by Michael Kay over 7 years ago

Logged as a bug here:

https://saxonica.plan.io/issues/3162

Please use the bug tracker for any further correspondence.

(1-2/2)

Please register to reply

Project

Profile

Help

Saxon