Bug #3398
closedStylesheet runs in 25s on EE-9.5, takes 10hrs+ on EE-9.7 or EE-9.8
Added by Michael Kay over 7 years ago. Updated almost 7 years ago.
100%
Description
A customer has submitted a stylesheet that shows a pathological performance regression between 9.5 and 9.7.
A -TP profile shows that nearly all the time is spent in one template. This is doing a join. Using -explain suggests that the join is being properly optimized to use indexes in both releases, though the detail of the execution plan is slightly different.
However, outputting timed xsl:message output indicates that there is a significant slowdown even before that template is reached: in 9.5 the first message (after the starting message) is output after 2s, on 9.8 it takes 3m.
Updated by Michael Kay over 7 years ago
I modified the stylesheet to stop after the first phase, in which variable groupAllItemsByParentID is created (this phase takes 2s on 9.5 and 3m on 9.8).
The 9.8 Java profile is dominated by calls on GeneralComparisonEE.compareOneToOne, called from FilterIterator.next(), which in turn is called from a DocumentOrderIterator, implying that something is being sorted into document order.
This appears to be the expression <xsl:for-each select="$allItems/*[@PARENT_PRODUCTGROUP_ID = 'null']"> on line 49.
$allItems is a global variable declared as [xsl:variable name="allItems" select="*"/], so we wouldn't expect a sort into document order to be necessary here.
Because the template is called recursively, this sort will be done repeatedly, so it could be significant. (However, the profile doesn't indicate that we are spending a lot of time sorting).
More significantly, 9.5 uses indexing for this filter expression, 9.8 does not. I think the reason for this is that we're searching for a literal, and 9.8 is assuming that there's no point building an index if we're searching for the same value every time: in fact, the whole expression $allItems/*[@PARENT_PRODUCTGROUP_ID = 'null'] should be promoted to a global variable since it has no local dependencies, and evaluated once. We need to see why this isn't happening.
Updated by Michael Kay over 7 years ago
But now we hit a complication. If we set -jit:off to suppress just-in-time compilation of template rules, execution time under 9.8 comes down to a few seconds. But this is a 9.8-specific feature, this can't have any relevance under 9.7.
It's highly relevant for 9.8, however, because with JIT enabled, expressions in the body of a template rule are never extracted as global variables, because of the complexity of creating new global variables on the fly. Perhaps we should put the 9.8 investigation to one side and focus first on 9.7.
Updated by Michael Kay over 7 years ago
With 9.7, we still haven't completed the first phase after 18 minutes: much worse than 9.8.
What on earth can be taking so long? Running it in the debugger to find out...
There appears to be a problem in the template
<xsl:template name="createHierarchyNamesFromReport">
<xsl:param name="self"/>
<xsl:for-each select="$pgrLangItems/items/*[@PRODUCTGROUP_ID = $self]">
<name lang="{@PRODUCTGROUP_LANGUAGE}">
<xsl:value-of select="@PRODUCTGROUP_NAME"/>
</name>
</xsl:for-each>
</xsl:template>
The expression $pgrLangItems/items/* has been promoted to a global variable, but for some reason it isn't indexed. Rather perversely, however, $self is indexed (the predicate is evaluated as an IndexedLookupExpression). This isn't as perverse as it seems, because the parameter $self has no declared type (in fact, nothing in this stylesheet has a declared type) so the compiler has to assume the worst.
But the IndexedLookupExpression isn't working as it should: it's taking the recovery path where it discovers that the value of $self is not an indexedValue, and so it creates an IndexedValue. If it were using the parameter $self directly, it would then put the IndexedValue back in the variable slot, but it's actually using a synthetic variable holding data(self), and it doesn't do so in this case. Fixing this, however, doesn't solve the problem.
I have established that by declaring the types of the template parameters on lines 74 and 88 using as="xs:string", execution time (of phase 1) comes down from 15+ minutes to 3 seconds.
I think what's happening is that we first optimize the filter predicate [@ATT = $self] as an IndexedLookupExpression, in the belief that $self might in the worst case be a large node-set. Having done this, the predicate is no longer a Comparison, which means that the outer filter expression is no longer a candidate for indexing. So we've done a low-value optimization that inhibited a subsequent high-value optimization. Yet another case where the optimizer is thwarted by a lack of type information.
Updated by Michael Kay over 7 years ago
By avoiding the conversion of a GeneralComparisonEE to an IndexedLookupExpression if its parent is a FilterExpression, we bring the elapsed time for the transformation under 9.7 to 1m 55s.
For 9.8 the effect is very similar: 2m 2s.
I will do a bit more probing though before closing this, to see if we can do better.
Updated by Michael Kay over 7 years ago
- Status changed from New to In Progress
Under 9.8, the Java trace again shows a significant amount of time in GeneralComparison.compare(), called from FilterIterator.next() -- though it's nothing like as dominant as it was.
I tried it again with -TP to get a Saxon-level profile and it ran "for ever". Unfortunately -TP injects code into the expression tree and this can inhibit some optimizations. But there's only one gcEE in the compiled stylesheet, in the template createHierarchyFromReport. Why is this still not being indexed? The very similar expression in createHierarchyNamesFromReport is converted into a call on key().
The difference is that $pgrLangItems has an inferred type of document-node(), whereas $allItems is a global variable with select="*", so we know nothing about its type. In particular, we don't know that the global context item will be a single document item, so we can't handle this using key(). It's hard to see how we can do better than this without changing the stylesheet - and if we're going to change the stylesheet, then we can solve the problem easily by using an expliclt key.
Going back to 9.5 I'm now seeing a time of 1m 15s which isn't quite the 25s I quoted in the title, but is still better than the 9.7 and 9.8 times. Looking at the 9.5 -explain output, for the createHierarchyFromReport template, Saxon 9.5 has extracted a global variable holding $allItems/*, and has defined an index on this with use=@PARENT_PRODUCT_GROUP, while 9.7 and 9.8 haven't achieved this optimization.
Updated by Michael Kay over 7 years ago
Looking at 9.8, the reason an IndexedFilterExpression doesn't get created is that we only do this if the base B of the filter expression B[P] is a variable reference. The expression $allItems/* won't be turned into a local variable reference because it's not in a loop, and it won't be turned into a global variable reference until a very late stage of optimization.
However, the actual problem seems to be that the expression $allItems/* is being treated by the GlobalExtractor as ineligible for conversion to a global variable. This is happening because its estimated cost is 2, and the threshold for creating a global variable is 3. It seems we are underestimating the cost. The cost metric used by the GlobalExtractor seems to have been simplified in order to reduce the cost of estimating the cost...
If we fix the cost metric, there is another reason the expression $allItems/* doesn't get promoted to global: the logic quite incorrectly decides that because child::element is dependent on the focus, neither it nor any of its ancestor expressions can be promoted to global.
When we fix this, the expression does get promoted, and an IndexedFilterExpression does get created. However, the performance doesn't improve very much: only to 1m 51s.
Updated by Michael Kay over 7 years ago
I took a more careful measurement, after suspending background tasks, and it finished in 54s. But under the same conditions, 9.5 is finishing in 32s.
The Java profile now shows nothing untoward: performance is dominated by XML parsing, serialization, whitespace stripping, indexing. So I think we'll close it with these changes (which still need to be implemented at production quality).
I don't propose to make any changes on the 9.7 branch: there's too much risk of causing regression to existing workloads.
Updated by Michael Kay over 7 years ago
- Status changed from In Progress to Resolved
- Applies to branch 9.7, 9.8 added
- Fix Committed on Branch 9.8 added
Patch applied on 9.8 branch.
No action for 9.7.
Updated by O'Neil Delpratt about 7 years ago
- Status changed from Resolved to Closed
- Sprint/Milestone set to 9.8.0.5
- % Done changed from 0 to 100
Big fix applied in the Saxon 9.8.0.5 maintenance release
Updated by Michael Kay almost 7 years ago
- Fix Committed on Branch 9.7 added
Since the change proved successful for 9.8, and we're planning a 9.7 maintenance release, I decided to make the commit there as well.
Please register to edit this issue