Project

Profile

Help

Slow template in 9.6 is three times slower in 9.7.0.1

Added by Gioele Barabucci over 8 years ago

Hello,

I am investigating what causes a certain XSLT to be very slow when evaluated with 9.6.0.8. I noticed that the same XSLT is two to three times slower when applied by Saxon-HE 9.7.0.1.

Transforming with 9.6 takes about three minutes, with 9.7 about ten minutes. Profiling shows that most of the templates take exactly the same time in 9.6 and 9.7, except the three most expensive templates (average net time).

For 9.6.0.8:

instruction  count  avg time (gross)  total time (gross)  avg time (net)  total time (net)
template H1   6364            31.069          197723.249          18.042        114817.847
template H2   4891            14.700           71897.247          13.619         66610.369
template H3   7154             1.976           14138.150           1.857         13283.643

For 9.7.0.1:

instruction  count  avg time (gross)  total time (gross)  avg time (net)  total time (net)
template H1   6364            99.613          633935.928          58.235        370609.074
template H2   4891            49.567          242434.252          47.473        232192.494
template H3   7154             3.955           28296.606           3.785         27077.704

This is the code of these three templates


        
                
                
                
        


        
                
                
                
        


        
                
                
                
        

What this template is trying to achieve is to nest all the H1A, H1B and H1C elements that follow a certain H1 element into the generated for that element. Similarly for H2 and H3. (Plus additional rules to deal with the weirdness of the original XML format.)

The data has this format


    

foo foo

quuz1 quuz2

bar bar

qaaz5 qaaz6

buz buz

quuz8 quuz9

and the end result should be this


    foo foo

    quuz1
    quuz2

    
        bar bar
        qaaz5
        qaaz6
    


    buz buz

    quuz8
    quuz9

Profiling inside oxygen 17.1 (unfortunately with Saxon 9.6.0.7) shows that most of the time is spent calling the current() function, so the culprit of the slowness may be the expressions like preceding-sibling:H1[1] is current().

Are there better ways to express the same equivalence or achieve the same grouping?

How come 9.7.0.1 is so much slower than 9.6 on that code?


Replies (9)

Please register to reply

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Martin Honnen over 8 years ago

I think the problem can be solved using @xsl:for-each-group group-starting-with@. I have not tested whether it performs any better, here is a sample stylesheet showing a recursive function using the grouping instruction:











  
  
  
  
    
      
        
          
          
        
      
      
        
      
    
  



  
    
  



  
    
  


 
  
    
  



RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

Thanks. We will investigate - though it might not be till after the holiday season.

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

I suspect (like Martin Honnen) that there are better ways of writing this code, but before we explore that avenue, I would like to try and understand why the performance regression occurs between 9.6 and 9.7. It's probably some quirk of the optimizer, but optimization is very sensitive to minor variations in the input, so I think it would be help us greatly to have access to the full stylesheet, plus a source document that is large enough to demonstrate the problem.

At first sight there is no obvious difference in the -explain output between 9.6 and 9.7 that would explain the difference. The first thing we need to do is to check that we can reproduce your observations.

The suggestion that "most of the time is spent executing the current() function" seems rather strange, since the current() function is not actually executable: it is always converted to a synthetic local variable. What evidence made you come to this conclusion?

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Gioele Barabucci over 8 years ago

Michael Kay wrote:

I suspect (like Martin Honnen) that there are better ways of writing this code, but before we explore that avenue, I would like to try and understand why the performance regression occurs between 9.6 and 9.7.

Thank you (and Martin) for the help!

However I would like to point out that this is not exactly a regression. I would call it an exacerbation: 9.7 is slower, but 9.6 is also too slow and requires too much memory to perform that transformation on the original set of data. I had to split the XML in various pieces and recompose it later (with manual adjustments) to make the transformation doable. My hope was (and still is) that using 9.7 plus some adjustments to the XSLT will make it possible to run the whole transformation in one go.

It's probably some quirk of the optimizer, but optimization is very sensitive to minor variations in the input, so I think it would be help us greatly to have access to the full stylesheet, plus a source document that is large enough to demonstrate the problem.

I sent the full stylesheet and all the source documents via email.

The suggestion that "most of the time is spent executing the current() function" seems rather strange, since the current() function is not actually executable: it is always converted to a synthetic local variable. What evidence made you come to this conclusion?

It seems strange to me as well, but this is all I have been able to find out up to now.

First, using the profiler of oXygen 17.1 (with Saxon 9.6.0.7) I saw that most of the time was spent in the three templates for H1, H2 an H3 with very little time spent in the templates for the nested elements. This is in line with what I see using the Saxon profiler for 9.6.0.8 and 9.7.0.1.

Second, the oXygen profiler shows that the time spent in H1, H2 and H3 is spent mostly on function-call (name="current") The number of invocations reported is in line with the number elements selected by the outermost XPath (e.g. the number of following-sibling::H2 in H1).

I read these two clues as "Most of the time is spent finding out which of the following siblings should be transformed rather than doing the transformations. This is slow because of the many comparisons between preceding-sibling::H1[1] and current()".

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

I've reproduced the problem: in my initial tests (running Saxon-EE with bytecode enabled) I get an execution time of 2m 15s under 9.6, compared with 7m 39s under 9.7.

I then halved the size of the input file and this reduced the execution times to 22s and 1m 14s respectively: almost a factor of 8 in each case, suggesting that the algorithm used is probably O(n^3). Which is worse that I expected: I would have expected O(n^2) from the code given above.

(I noticed afterwards that the 9.6 figures were using Java 6, the 9.7 figures were using Java 8. Subsequent measurements are under Java 6.)

I then ran Java profiling for both releases. Most of the time was spent where one would expect: evaluating the following-sibling and preceding-sibling axes, and the boolean predicates used to filter them. The main difference between 9.6 and 9.7 is that 9.7 is spending a lot of time in ExpressionTool.eagerEvaluate(), which is a method used to decide between alternative evaluation strategies, and which should normally only be called at stylesheet compile time. I will now investigate why this is being called at run-time.

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

It seems that 9.6 is doing a much more efficient job of evaluating exists(self::X) than 9.7. Within a filter expression, the matches() method is calling predictate.effectiveBooleanValue(). In 9.6 "predicate" is an instance of Exists, which has an effectiveBooleanValue() method which in this case calls arg[0].iterate(context).hasNext(). In 9.7 classes such as Exists represent the function rather than the function call (a consequence of redesign to accommodate higher order functions), so the call on effectiveBooleanValue is no longer as streamlined.

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

I made a change to the code for Exists in 9.7 to restore the custom implementation of effectiveBooleanValue() and this reduces the execution time to 26s, which is pretty-well the same as the 9.6 execution time. Need to do the same for the empty() function, and also to look at whether there are other functions that merit similar treatment. This basically fixes the regression between 9.6 and 9.7: I'll take a look now at how to improve the algorithm.

For follow-up, see this bug entry:

https://saxonica.plan.io/issues/2565

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay over 8 years ago

Let's start with this expression:


The problem here is obviously that all 10000 following-siblings are being considered, when those after the next H1 are never going to satisfy the predicate.

In theory Saxon could recognize this pattern and optimize it. But there's a law of diminishing returns as optimizations become more and more specialised: in particular, every new optimization multiplies the number of tests needed to ensure that it's bug-free. So let's focus on how it can be rewritten.

One way is to use a recursive function:


  
  
  
    
  

But it's much easier, and probably more efficient, to use the positional grouping constructs built into the language (that is, xsl:for-each-group with group-starting-with). Martin Honnen has already given you a good steer on how to approach this. His solution should have linear performance (i.e. elapsed time proportional to input size), which is far better than you are achieving now.

RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Gioele Barabucci over 8 years ago

Thank you for all the suggestions.

I didn't write the code and there are only few tests, so I adapted thef:siblings-until function and made only minimal changes in the logic. The transformation is now quite fast even using 9.6 (theoretically O(n^2) because the lookahead is unbounded, but in practice O(n)).

    (1-9/9)

    Please register to reply