Slow template in 9.6 is three times slower in 9.7.0.1
Added by Gioele Barabucci about 9 years ago
Hello,
I am investigating what causes a certain XSLT to be very slow when evaluated with 9.6.0.8. I noticed that the same XSLT is two to three times slower when applied by Saxon-HE 9.7.0.1.
Transforming with 9.6 takes about three minutes, with 9.7 about ten minutes. Profiling shows that most of the templates take exactly the same time in 9.6 and 9.7, except the three most expensive templates (average net time).
For 9.6.0.8:
instruction count avg time (gross) total time (gross) avg time (net) total time (net) template H1 6364 31.069 197723.249 18.042 114817.847 template H2 4891 14.700 71897.247 13.619 66610.369 template H3 7154 1.976 14138.150 1.857 13283.643
For 9.7.0.1:
instruction count avg time (gross) total time (gross) avg time (net) total time (net) template H1 6364 99.613 633935.928 58.235 370609.074 template H2 4891 49.567 242434.252 47.473 232192.494 template H3 7154 3.955 28296.606 3.785 27077.704
This is the code of these three templates
What this template is trying to achieve is to nest all the H1A
, H1B
and H1C
elements that follow a certain H1
element into the generated for that element. Similarly for H2 and H3. (Plus additional rules to deal with the weirdness of the original XML format.)
The data has this format
foo foo¶
quuz1 quuz2bar bar¶
qaaz5 qaaz6buz buz¶
quuz8 quuz9
and the end result should be this
foo foo quuz1 quuz2 bar bar qaaz5 qaaz6 buz buz quuz8 quuz9
Profiling inside oxygen 17.1 (unfortunately with Saxon 9.6.0.7) shows that most of the time is spent calling the current()
function, so the culprit of the slowness may be the expressions like preceding-sibling:H1[1] is current()
.
Are there better ways to express the same equivalence or achieve the same grouping?
How come 9.7.0.1 is so much slower than 9.6 on that code?
Replies (9)
Please register to reply
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Martin Honnen about 9 years ago
I think the problem can be solved using @xsl:for-each-group group-starting-with@. I have not tested whether it performs any better, here is a sample stylesheet showing a recursive function using the grouping instruction:
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
Thanks. We will investigate - though it might not be till after the holiday season.
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
I suspect (like Martin Honnen) that there are better ways of writing this code, but before we explore that avenue, I would like to try and understand why the performance regression occurs between 9.6 and 9.7. It's probably some quirk of the optimizer, but optimization is very sensitive to minor variations in the input, so I think it would be help us greatly to have access to the full stylesheet, plus a source document that is large enough to demonstrate the problem.
At first sight there is no obvious difference in the -explain output between 9.6 and 9.7 that would explain the difference. The first thing we need to do is to check that we can reproduce your observations.
The suggestion that "most of the time is spent executing the current() function" seems rather strange, since the current() function is not actually executable: it is always converted to a synthetic local variable. What evidence made you come to this conclusion?
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Gioele Barabucci almost 9 years ago
Michael Kay wrote:
I suspect (like Martin Honnen) that there are better ways of writing this code, but before we explore that avenue, I would like to try and understand why the performance regression occurs between 9.6 and 9.7.
Thank you (and Martin) for the help!
However I would like to point out that this is not exactly a regression. I would call it an exacerbation: 9.7 is slower, but 9.6 is also too slow and requires too much memory to perform that transformation on the original set of data. I had to split the XML in various pieces and recompose it later (with manual adjustments) to make the transformation doable. My hope was (and still is) that using 9.7 plus some adjustments to the XSLT will make it possible to run the whole transformation in one go.
It's probably some quirk of the optimizer, but optimization is very sensitive to minor variations in the input, so I think it would be help us greatly to have access to the full stylesheet, plus a source document that is large enough to demonstrate the problem.
I sent the full stylesheet and all the source documents via email.
The suggestion that "most of the time is spent executing the current() function" seems rather strange, since the current() function is not actually executable: it is always converted to a synthetic local variable. What evidence made you come to this conclusion?
It seems strange to me as well, but this is all I have been able to find out up to now.
First, using the profiler of oXygen 17.1 (with Saxon 9.6.0.7) I saw that most of the time was spent in the three templates for H1
, H2
an H3
with very little time spent in the templates for the nested elements. This is in line with what I see using the Saxon profiler for 9.6.0.8 and 9.7.0.1.
Second, the oXygen profiler shows that the time spent in H1, H2 and H3 is spent mostly on function-call (name="current")
The number of invocations reported is in line with the number elements selected by the outermost XPath (e.g. the number of following-sibling::H2
in H1
).
I read these two clues as "Most of the time is spent finding out which of the following siblings should be transformed rather than doing the transformations. This is slow because of the many comparisons between preceding-sibling::H1[1]
and current()
".
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
I've reproduced the problem: in my initial tests (running Saxon-EE with bytecode enabled) I get an execution time of 2m 15s under 9.6, compared with 7m 39s under 9.7.
I then halved the size of the input file and this reduced the execution times to 22s and 1m 14s respectively: almost a factor of 8 in each case, suggesting that the algorithm used is probably O(n^3). Which is worse that I expected: I would have expected O(n^2) from the code given above.
(I noticed afterwards that the 9.6 figures were using Java 6, the 9.7 figures were using Java 8. Subsequent measurements are under Java 6.)
I then ran Java profiling for both releases. Most of the time was spent where one would expect: evaluating the following-sibling and preceding-sibling axes, and the boolean predicates used to filter them. The main difference between 9.6 and 9.7 is that 9.7 is spending a lot of time in ExpressionTool.eagerEvaluate(), which is a method used to decide between alternative evaluation strategies, and which should normally only be called at stylesheet compile time. I will now investigate why this is being called at run-time.
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
It seems that 9.6 is doing a much more efficient job of evaluating exists(self::X) than 9.7. Within a filter expression, the matches() method is calling predictate.effectiveBooleanValue(). In 9.6 "predicate" is an instance of Exists, which has an effectiveBooleanValue() method which in this case calls arg[0].iterate(context).hasNext(). In 9.7 classes such as Exists represent the function rather than the function call (a consequence of redesign to accommodate higher order functions), so the call on effectiveBooleanValue is no longer as streamlined.
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
I made a change to the code for Exists in 9.7 to restore the custom implementation of effectiveBooleanValue() and this reduces the execution time to 26s, which is pretty-well the same as the 9.6 execution time. Need to do the same for the empty() function, and also to look at whether there are other functions that merit similar treatment. This basically fixes the regression between 9.6 and 9.7: I'll take a look now at how to improve the algorithm.
For follow-up, see this bug entry:
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Michael Kay almost 9 years ago
Let's start with this expression:
The problem here is obviously that all 10000 following-siblings are being considered, when those after the next H1 are never going to satisfy the predicate.
In theory Saxon could recognize this pattern and optimize it. But there's a law of diminishing returns as optimizations become more and more specialised: in particular, every new optimization multiplies the number of tests needed to ensure that it's bug-free. So let's focus on how it can be rewritten.
One way is to use a recursive function:
But it's much easier, and probably more efficient, to use the positional grouping constructs built into the language (that is, xsl:for-each-group with group-starting-with). Martin Honnen has already given you a good steer on how to approach this. His solution should have linear performance (i.e. elapsed time proportional to input size), which is far better than you are achieving now.
RE: Slow template in 9.6 is three times slower in 9.7.0.1 - Added by Gioele Barabucci almost 9 years ago
Thank you for all the suggestions.
I didn't write the code and there are only few tests, so I adapted thef:siblings-until
function and made only minimal changes in the logic. The transformation is now quite fast even using 9.6 (theoretically O(n^2) because the lookahead is unbounded, but in practice O(n)).
Please register to reply