Saxon 9.9 performs slower compared to Saxon 9.7 on nested loops
Hi, We are currently using Saxon9.9 PE. Previously, we used Saxon9.7 PE. We have an xquery which has 4 levels of loops (3nested) with where conditions on the for expressions. It has been noted that executing in Saxon 9.9 took much longer than Saxon 9.7 the more there are nested loops.
In the sample xquery attached, with 1000rows of input,
- Saxon 9.7 took an average of 138665ms.
- Saxon 9.9 took an average of 237548ms.
If tested only with 3levels, it was much faster:
- 9.7 at 3335ms
- 9.9 at 5135ms
Based on the profiler, it took more time on the where clause in 9.9 compared to 9.7
#1 Updated by Michael Kay 7 months ago
Execution time for this query in Saxon-EE (not very carefully measured)
9.7 - 86.25s 9.9 - 46.26s 10.3 - 55.23s
For this kind of query the Saxon-EE optimiser really makes a big difference, as you can see.
There's always going to be some variation between releases: optimisation of complex queries involves a lot of guesswork, and changes will generally benefit some queries at the expense of others. We have a reference set of queries that we study very carefully looking for regression between releases, but it's a very small sample compared with what is encountered "in the wild".
#2 Updated by Michael Kay 6 months ago
My timings for Saxon-HE:
126.96.36.199 - 92.052s 188.8.131.52 - 164.553s 10.3 - 127.545s
I'm a little surprised that the join optimization in Saxon-EE isn't making a bigger difference, but I guess we should focus on the difference between releases rather than the difference between editions.
#3 Updated by Michael Kay 6 months ago
Examining the -explain output, I notice
(a) 9.7 is rewriting the expression
count(X)=0 (which occurs repeatedly) to
empty(X); this optimisation does not seem to be working in 10.3.
(b) at line 82, both releases are sorting the result of the expression
$out_root/sa/a into document order, which is not necessary for the argument of
I can't see any other obvious differences in the execution plans, apart from the rewrite of
#5 Updated by Michael Kay 6 months ago
I looked at both the 9.7 and 10.3 runs in JProfiler and there weren't any blatantly obvious differences in the profiles.
In both profiles,
Closure.saveContext() shows up rather more than we would usually expect: this is spending its time saving a copy of local variables to support lazy evaluation.
#6 Updated by Michael Kay 6 months ago
I ran both 9.7HE and 10.3HE with the -TP option, Oddly, under these conditions 9.7 was slower. The execution counts for the different functions were identical, and the relative time in each function was very similar in both cases. The cost is dominated by the function local:out_rowd on line 3, which is executed 98,010 times.
I note that neither release is eliminating the unused position variable
$out_rowd_i, which might inhibit further optimisations.
#7 Updated by Michael Kay 6 months ago
A minor but possibly significant difference between the execution plans is that in 9.7, the evaluation mode for variable $c on line 14 is MAKE_CLOSURE, while in 10.3, it is MAKE_MEMO_CLOSURE.
I'm surprised that neither 9.7 nor 10.3 loop-lifts the let expression at line 14 out of the containing FLWOR expression, since it seems to have no dependencies on the outer clauses.
#8 Updated by Michael Kay 6 months ago
If we drop the unused variable
at $out_rowd_i, the execution time in 10.3 HE comes down to 48.17s. This doesn't answer the question of why the performance regression occurred, but it does illustrate how sensitive the performance can be to tiny details of the optimization plan. Frankly, I'm more interested in identifying optimisation opportunities like this than in explaining differences between releases.
Looking at the explain output, the removal of this variable results in the declaration of $c being loop-lifted.
I'm looking now at the optimisation paths in the debugger. There has been some reorganisation of optimiser logic between releases, and some optimisations are now present in the EE optimiser only: this is why
count(X)=0 is no longer rewritten as
#9 Updated by Michael Kay 6 months ago
I'm going to test and commit the code to drop the position variable if it is unused, and then close this bug.
The change appears to reduce the elapsed time of the query to around 22s.
We sometimes get a query like this where there are so many optimisation opportunities, it's almost inevitable that different releases will handle the optimisation differently, and there's no real way of knowing which optimisations are going to be the most useful. It's a great discovery that removing an unused positional variable can make such a difference, but it's not going to affect many users, because unused variables don't arise that often.
Almost certainly the query could be speeded up significantly by adding type declarations to the parameters of functions. There are also expressions in the query that are clearly wrong, like
@xsi:nil['true'] where almost certainly
@xsi:nil[.='true'] was intended. So there's probably not that much benefit in further analysis of the 9.7/10.3 differences.
#10 Updated by Michael Kay 6 months ago
- Tracker changed from Support to Bug
- Subject changed from Saxon 99 performs slower compared to Saxon 97 on nested loops to Saxon 9.9 performs slower compared to Saxon 9.7 on nested loops
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch 10, trunk added
- Fix Committed on Branch 10, trunk added
Please register to edit this issue