Bug #2707: Performance regression compiling very large stylesheets - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2707

closed

Performance regression compiling very large stylesheets

Added by Michael Kay over 8 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2016-04-11

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

9.7, trunk

Fix Committed on Branch:

9.7, trunk

Fixed in Maintenance Release:

9.7.0.6

Platforms:

Description

When compiling a very large stylesheet (of the order of 10Mb), performance in 9.7 is somewhat worse than 9.6 and a lot worse than 8.8.

The XSLT code is produced by a code generator and is very monolithic. Compile time greatly exceeds transformation time.

Related issues

Actions

Copy link

Updated by Michael Kay over 8 years ago

Status changed from New to In Progress

I've taken a Java profile and it's clear that a lot of time is spent chasing around analyzing the usage of local variables within templates. There are a number of paths here that are quadratic in the size of the template (because we perform an operation once per local variable, and the cost of the operation is proportional to the scope of the variable). Examples include:

counting how many references there are to a variable, used when deciding whether the variable can be inlined
determining whether all references to a variable are atomized/stringified, used to avoid constructing a document tree for the commonly-used inefficient construct

<xsl:variable name="x">

<xsl:value-of select="x"/>

</xsl:variable>

deciding whether a variable needs to be eagerly evaluated because there is a reference to the variable within a try/catch clause.

In previous releases we counted the references to a variable during type checking to avoid doing an extra pass. This was found to be unreliable as we weren't recomputing the value after optimization rewrites. This may account for some of the regression.

We now need to investigate which of the various analysis operations is contributing most heavily to the cost. This isn't easy since they use a lot of common code. It may be best to use "subtractive measurement", that is, to see how much the performance improves if particular steps are eliminated.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Baseline cost for compiling the stylesheets, with bytecode generation switched off, is 23.7 seconds. Note: I have disabled binding of extension functions, by making all extension function calls resolve to an empty sequence.

Removing the check on "allReferencesAreFlattened()" at LetExpression#211 makes no measurable difference.

Cutting out the code in Assignation.refineTypeInformation makes only a small difference (->22.2 seconds)

Cutting out LetExpression.recomputeRefCount() makes a slightly bigger difference (-> 20.7 seconds)

Cutting out LetExpression.requiresEagerEvaluation() also makes only a modest difference (->22.0 seconds)

Cutting out Expression.restoreParentPointers() makes no difference.

Cutting out ExpressionTool.inlineVariableReferences() does better (-> 18.9 seconds)

Actions

Copy link

Updated by Michael Kay over 8 years ago

The number of variables being inlined is actually quite modest, so it's surprising that the cost of doing it is (apparently) so high.

One of the variables being inlined a lot is DFCounterMode, which is declared using the classic:

<xsl:variable name="DFCounterMode">

<xsl:value-of select="'ShowOnly'"/>

</xsl:variable>

(rather than the more efficient <xsl:variable name="DFCounterMode" select=""'ShowOnly'"/>)

It might be worth exploring whether inlining an RTF variable is particularly expensive.

Actions

Copy link

Updated by Michael Kay over 8 years ago

As an experiment I tried exporting the stylesheet using -export:all.xsltp. The export file was 85Mb. Reloading the stylesheet from the export file took just 4.3 seconds. So even if we can reduce the compilation time, loading the stylesheet from the export file rather than recompiling it each time sounds like a good strategy.

Actions

Copy link

Updated by Michael Kay over 8 years ago

I ran the compilation with -opt:0 and this reduced the cost to 10.9 seconds. That's probably not an unreasonable strategy when compile time dominates execution time quite so dramatically as in this stylesheet. It also gives us a feel for how much of the time is being spent in optimization.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Generally, when counting the number of references that exist to a variable, we are only interested in the answer 0, 1, or >1. I have therefore changed the code to stop counting once it finds more than some threshold (4). This reduces the compilation time to 16 seconds.

I am now going to calibrate this against previous releases.

Actions

Copy link

Updated by Michael Kay over 8 years ago

To get this to compile on Saxon 9.6 I had to increase the memory available (I set it to 2gb). With this setting I got a compile time of 57 seconds, considerably slower than the 9.7 figure even before this improvement.

For 9.5 I got a time of 117 seconds.

For 9.4 - 11.4 seconds

For 9.1 - 10.7 seconds. This figure is comparable to the 9.7 figure running with optimization switched off.

This suggests a serious regression between 9.4 and 9.5, and gradual improvement since.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Returning to 9.7, I have reinstated the original code that handles references to unresolved extension functions (in 1.0 compatibility mode, such references do not cause a compile-time error, but rather cause code to be generated that forces a run-time error). After reinstating this code, the compile time increases to 52 seconds.

This casts doubt on the previous measurements. It's possible, for example, that by artificially suppressing the extension function calls we were enabling removal of (apparently) dead code and therefore eliminating some of the costs of optimizing this code.

Switching optimization off is now less effective: it only reduces the time to 45 seconds. With -opt:0, the Java profile is still dominated by countReferences, gatherVariableReferences, contains, and requiresEagerEvaluation, all of which are operations that walk the expression tree at compile time. The Java profile shows no evidence that the extension functions have any significant impact.

After reinstating the original code for extension functions, the patch to reduce the cost of counting variable references makes very little difference.

Exporting the stylesheet now produces a 130Mb export file (confirming that the previous attempt was causing "dead" code to be removed), but it still loads in 6seconds, so this remains a good strategy.

Actions

Copy link

Updated by Michael Kay over 8 years ago

I've now tried completely eliminating LetExpression.optimize(), and it hardly affects the compile time at all. The is consistent with the observation that running with -opt:0 has very little impact. But it's very puzzling, because a lot of the methods that appear heavily in the Java profile are called from LetExpression.optimize().

Actions

Copy link

#10

Updated by Michael Kay over 8 years ago

I tried some other monitoring options in JVisualVM and decided to look at the paths for loading extension functions. In this particular case the extension functions aren't present so loading always fails. I decided to experiment with a cache of failed class loading attempts so that subsequent attempts to load the same class fail faster. The result was dramatic: compile time reduced to 8.4 seconds.

The problem here is that this is now a completely unrealistic scenario; in the client environment, the load attempts presumably succeed. (Or perhaps not? We know that a lot of the code in the stylesheet is never executed.) But we don't seem to be caching successful attempts to load a class either, and there is surely room for improvement there?

Why the performance regression, though? I added a counter to the DynamicLoader.getClass() method, and we are seeing 201,000 calls of this method in 9.7, but only 23,300 calls in 9.4. So I think we might be homing in on the problem: but it's hard to be sure because we're now going down a rather artificial path given that the external class loading always fails.

In both 9.4 and 9.7, I moved the counter to JavaExtensionLibrary.bind() and the counts are the same, that is, we are making 10 times as many calls to JEL.bind() in 9.7.

Further investigation shows that the extra calls arise in the error path: when we fail to find an extension function with a particular arity, we attempt to look for a function with the same name but different arity, in order to improve the quality of diagnostics. Removing this search (and with no attempt to cache "unknown classes") reduces the compile time to 15 seconds.

So: is it perhaps the case that in the "live" environment, there are many calls to extension functions that are not actually present on the classpath?

Actions

Copy link

#11

Updated by Michael Kay over 8 years ago

Status changed from In Progress to AwaitingInfo

Actions

Copy link

#12

Updated by Michael Kay over 8 years ago

Status changed from AwaitingInfo to In Progress

We have now received copies of the JAR files containing the extension functions being called. First run with this on the classpath (on 9.7) shows the compilation time slowing to 2m 5s. The Java trace is dominated by code involved with binding the extension functions.

Further investigation shows that the class loading was actually failing because the JAR files were compiled under JDK 1.7 and I was running under JDK 1.6 (in order to use runhprof). Switched to running under JDK 1.8. Class loading now failed (fatally) because the JAR files have unsatisfied dependencies (it seems Saxon, when attempting to load extension functions in backwards compatibility mode, treats a "class not found" error as recoverable, but other class loading errors as fatal).

Downloaded and installed JFreeChart. Now fails:

java.lang.NoClassDefFoundError: com/assentis/docbase/xslt/exception/StyleSheetExtensionException

I've made various attempts to download additional libraries on which these JAR files depend, without success. So this line of attack doesn't seem to be leading anywhere.

What does seem clear is that whether or not class loading succeeds, it is taking a lot of time, and there is potential for reducing that by remembering the result of class loading attempts (whether successful or not). The only slight difficulty here is the technicality that it's in principle possible to make two class loading attempts with different ClassLoaders and get different results.

Actions

Copy link

#13

Updated by Michael Kay over 8 years ago

I tried another angle. I generated a 25Mb stylesheet consisting largely of calls to extension functions - functions that are known to exist.

9.7.0.5:

Stylesheet compilation time: 15.781888s (15781.888479ms)

Execution time: 38.008968ms

Memory used: 1066417208

9.4.0.9:

Stylesheet compilation time: 10536 milliseconds

Execution time: 34ms

Memory used: 820940072

So for this case, there's a bit of regression from 9.4 to 9.7 but nothing terribly serious.

Interestingly, in my first attempt at this I inadvertently generated a stylesheet in which one of the methods called (20,000 times...) was misspelt. With this version, the compile time under 9.4 was 91.4 seconds compared with 17.4 seconds for 9.7. But previously, we concluded that calls to absent extension functions were causing a problem in 9.7. So clearly we don't yet fully understand what is going on.

I tried caching the class loader attempts (in 9.7) and it didn't make much difference: compile time down from 17.4 seconds to 14.8, despite the fact that all but 1 of about 280,000 attempts were hitting the cache (the extension function calls generated were to a dozen different methods in java.lang.Math).

There's a hint in the Java trace that some of the extra cost in 9.7 might be in looking for XSLT 3.0 shadow attributes in the stylesheet. That's only going to show up in a stylesheet like this that is pathologically large.

The conclusion from this exercise is that whatever the problem is that the original user stylesheet is exhibiting, we are not reproducing it. It could be, for example, that the regression is in the cost of type-checking the Java methods against the XPath function call and is dependent on the specifics of the actual type signatures; we're not going to be able to diagnose that without successful loading of the original classes and methods from the user stylesheet.

Actions

Copy link

#14

Updated by Michael Kay about 8 years ago

I have now been supplied with JAR files containing all the extension functions needed to compile this stylesheet. The compile time under 9.7 is now 29.27s. (with bytecode generation switched off).

Under 9.4 it won't compile:

Error on line 1631 of MainStylesheet.xsl:

XPTY0004: Cannot perform comparisons involving external objects

The offending code is attempting to compare the result of a call to the extension function Scratchpad.read() (which returns Object) to a string.

To get around this I wrapped all calls on Scratchpad.read() in a call on string().

The stylesheet now compiles in 29.22s - exactly the same as the 9.7 timing.

So I downloaded Saxon-SA 8.8 and built an IntelliJ project with it and compiled the stylesheet. It compiled in 3.55 seconds: vastly faster than the 9.4 or 9.7 timings.

Getting 9.7 down to that figure is almost certainly an unrealistic aim; over the years we've been adding more and more compile time "cleverness" in order to reduce execution time. But it's salutary quite how big the difference is, and we ought at least to try and develop an understanding of where the extra cost is going.

Actions

Copy link

#15

Updated by Michael Kay about 8 years ago

From now on data is for 9.7 unless otherwise stated.

Java sampling using JVisualVM shows 80% of the time spent in 5 methods: LetExpression.requiresEagerEvaluation() (28%), ExpressionTool.contains() (15%), LetExpression.countReferences() (15%), Expression.restoreParentPointers() (12%), ExpressionTool.inlineVariableReferences() (11%).

All of the above do tree-walking within the code of a function or template, and will increase in cost according to the size of the function or template.

Running with -opt:0 reduces the compile time from 29.2s to 8.3s. So, as we might expect, the cost is dominated by optimization costs, and it's entirely plausible that the optimization cost for a function/template is quadratic in the size of the function/template.

Let's start with requiresEagerEvaluation(). This is executed for a LetExpression (i.e. a local variable) and it essentially determines whether the initializer for the variable contains (at any depth) a try/catch expression, an asynchronous xsl:result-document, or a multi-threading xsl:for-each. (These are all features introduced since 8.8, and they are all rarely used, so the result will almost always be false.) An alternative way to handle this (now that we have parent pointers in the expression tree) is for these three constructs to set a flag in any ancestor LetExpression. I've made this change and it brings the cost down from 29.2s to 24.8s.

The next two targets are ExpressionTool.contains() (32%) and Expression.restoreParentPointers() (24%).

The challenge with contains() is that it is called from a dozen different places and it's not clear from the JVisualVM sampling which call dominates.

Actions

Copy link

#16

Updated by Michael Kay about 8 years ago

To make progress analyzing the contains() calls I had to revert to good old -agentlib:hprof, with a very high stack depth because the recursive calls on contains() run very deep. But even with a stack depth of 100, the entire stack often contained recursive calls of contains() or restoreParentPointers() or inlineVariableReferences().

The problem with all these methods is that they are called once per node in the expression tree for a function/template, and their cost is proportional to the number of nodes in the function/template, therefore the cost ends up being quadratic in the size of functions/templates.

(As an experiment, I tried turning restoreParentPointers() into a no-op method. Total compile time reduced from 24.8s to 23.2s. The method is largely defensive programming, but doing this for real feels unnecessarily risky.)

I tried instrumenting calls on contains() by adding a label to the calls and maintaining a counter for each label. The most common labels were:

P - 402600 - ExpressionTool.containsLocalParam()

D - 150200 - ExpressionTool.dependsOnVariable()

F - 341100 - ExpressionTool.callsFunction()

The calls on containsLocalParam() are particularly surprising since this method is searching for any LocalParamSetter (regardless of the actual parameter) and the places where these can occur are very limited.

Actions

Copy link

#17

Updated by Michael Kay about 8 years ago

OK, let's look at ExpressionTool.containsLocalParam(). There are three routes to this method: via NamedTemplate.getLocalParam(name), via the (identical) TemplateRule.getLocalParam(name), and via code in PromotionOffer.EXTRACT_GLOBAL_VARIABLES. I suspect the last of these is the chief culprit because it's called essentially whenever we're considering an expression as a candidate for promotion to a global variable.

I suspect there's actually a bug in the code: it assumes that all local params within a template will have distinct names. But this isn't true of local params declared within an xsl:iterate instruction.

I think we can probably handle the EXTRACT_GLOBAL_VARIABLES case by giving LocalParamSetter the property HAS_SIDE_EFFECTS, and testing for this property in the PromotionOffer code. This brings the compile time down from 24.8s to 22.8s.

The method TemplateRule.getLocalParam() is unused and can be safely removed.

The method NamedTemplate.getLocalParam() is used when type-checking calls on xsl;call-template. It can be implemented much more efficiently by searching the list of parameters held at the template level, rather than searching the body of the template.

First attempt to do this actually gives no improvement (and a very slight regression) because the list of params is built lazily and uses the same inefficient approach of searching the whole body. Changed this to build the list of params while compiling the template. Current measurement is 23.5 but I don't think there's really a regression here, it's just a measurement glitch. Running with -repeat:10 the last five timings are 19.66, 18.25, 18.64, 19.06, 17.82.

Actions

Copy link

#18

Updated by Michael Kay about 8 years ago

Now looking at calls on ExpressionTool.callsFunction(). Added some instrumentation; it seems these are essentially all tests whether an expression calls the current() function. The vast majority of the calls emanate from ExpressionTool.resolveCallsToCurrentFunction(), which actually calls callsFunction() twice, once to look for same-context calls and once to look for different-context calls.

The search done by these calls is confined to a single XPath expression or pattern and the cost therefore does not increase with the size of the templates (only with the size of the XPath expressions). I've confirmed that removing this step entirely has very little impact on bottom-line performance. So let's leave this one alone.

This leaves the calls on ExpressionTool.dependsOnVariable(). I suspect these calls are incurring a lot of cost.

A fairly simple measure is to prune the search at a node where DEPENDS_ON_LOCAL_VARIABLES is false: if we know there are no dependencies on variables, we don't need to look for a specific one. Unfortunately this causes a crash ("Internal Saxon error: local variable encountered whose binding has been deleted") presumably because there is an expression that depends on local variables but does not have this property set.

I have added instrumentation to dependsOnVariable() which shows that the vast majority of calls are from PromotionOffer.FOCUS_INDEPENDENT. This path is looking for subexpressions that can be pulled out of a focus-dependent loop (typically a for-each, a filter predicate, or the RHS of "/"). I think there's a problem/opportunity on this path in that the check for variable references is being done before the check for focus dependency, despite being much more expensive. Reversing the order of the tests reduces the counter from 144,500 to 49,000. The speed-up, though, is not great: the last five times are now 18.09 18.65 18.79 17.52 18.13

Re-running the JVisualVM sampler, we're now seeing 23% of time spent in LetExpression.countReferences(), 10% in ExpressionTool.inlineVariableReferences(), 9% in Expression.restoreParentPointers(), 7% in ExpressionTool.contains(), 6% in ExpressionTool.gatherVariableReferences(), 3% in ExpressionTool.hasLoopingSubExpression(). It's probably time to focus on countReferences().

Actions

Copy link

#19

Updated by Michael Kay about 8 years ago

The problem with countReferences() is that the design is deliberately defensive. There's a long history of failures caused by data on the expression tree not being refreshed when tree rewrites are done, and we know that relying on the tree rewrite process to maintain all such derived data is error-prone. So before doing something that's only appropriate when there are no references to a variable (removing the variable), or when there is only one (inlining it), we check how many references there are, just to make sure. To prevent over-long searches the count already stops after it has found sufficiently many references that any more don't matter, but it's possible that finding the first three references could involve a very long search.

An alternative strategy might be for LetExpression to maintain a list of references. It should be easy enough to ensure that any newly created variable reference is added to the list. Ensuring that removed references are dropped from the list is more difficult; but in the places where we currently do countReferences() we could instead do checkReferences() to check that the references on the list are still present in the tree (we can search the ancestors of the variable reference to check that we reach the LetExpression, and at each step we can check that there is a corresponding downwards pointer).

Another possibility is to rely on the fact that tree rewrites always (should) call resetStaticProperties() - we could make this call reset the reference count to -1 and only recount references if the value is -1. The consequences of the reference count being too high are not serious; it should only result in failure to perform optimizations that would otherwise be performed.

LetExpression.countReferences() is current called 76m times.

Rather to my surprise, my first attempt to make the above change increases this to 115m. A possible explanation is that by reducing the number of variables we are inlining or eliminating, we are increasing the number of further optimization attempts.

Adding more counters: with the original code we inlined 7000 variables and eliminated 9000. With the revised code we inlined 7000 and eliminated 2000. It looks like we aren't eliminating a variable after inlining it.

After some adjustments we're still inlining 7000 and eliminating 10,000, and we still have 76m calls on countReferences(). So this isn't helping much.

Some more counters: the number of LetExpressions created is around 26,000, and the number of calls on LetExpression.optimize() is about 70,000 (which is not unreasonable; the code is designed to reoptimize if an optimization pass was successful.)

The number of external calls on countReferences is about 68,000 (one per optimization pass). The remainder are internal recursive calls, indicating that the typical LetExpression in the expression tree has 1000 or so descendant nodes. In fact if we eliminate the early exit from countReferences() after sufficient references have been found, the number of internal calls increases to 115m. Reducing the threshold for early exit from 10 to 3 only reduces the number of internal calls to 73m.

I'm experimenting with a design where every expression carries a mask (a kind of Bloom filter) that indicates which variables are referenced within that expression; if the bits of the hashcode of the variable name are not set in the mask, then that variable is not referenced within the subtree, and we therefore don't need to search the subtree. This is crashing out in exactly the same way (on variable TableId at line 2161) as the attempt to use the DEPENDS_ON_LOCAL_VARIABLES bit; the code is clearly deciding that there is no reference to the variable when there actually is, and is eliminating the variable. With a stylesheet of this size, that's very difficult to debug. However, the same crash occurs with a much simpler test case (sf-avg-A). Although many XSLT tests are passing with this change, it's clearly risky and I think I'm going to back off for the moment, at least as far as maintenance releases are concerned.

Actions

Copy link

#20

Updated by Michael Kay about 8 years ago

Trying another approach to the countReferences() problem: when walking the tree to find references, if we've walked far enough (100 node visits) we just return a number like 17 as the nominal reference count; this prevents optimizations such as variable elimination and inlining. The justification is that these optimizations aren't worth paying such a high price for. With this change, the number of internal calls on countReferences comes down to about 5m, the number of variables inlined drops to 4000, and the number eliminated drops to 6000.

With this change, the last 5 compilation times are 14.73 14.80 13.40 15.04 13.44 -- which is a definite improvement.

The Java profile is still dominated by inlineVariableReferences(), ExpressionTool.contains(), restoreParentPointers(), gatherVariableReferences()

Actions

Copy link

#21

Updated by Michael Kay about 8 years ago

Let's look now at the cost of inlining variables. I wouldn't expect this to show up so strongly because I don't think it happens that often; so it must be quite expensive. Adding some counters shows that we are inlining about 2000 variables because their values are constant, and about 4000 because there is only a single reference to the variable (and it isn't in a loop). LATER: also, missed in these counts, about 4000 variables are being inlined because the variable is identical to another variable (xsl:variable name="x" select="$y").

In the case of constant values, there's a much simpler approach: just rely on the fact that optimizing the subtree will replace any variable references by the constant. I'll try this, adding some code to VariableReference to make doubly-sure that this happens (the way it's written now, it might not happen if the initializing expression has become constant at some time after the binding was established). The benefits here are that we avoid calculating the reference count in this case, we do it even in the cases where the call on countReferences would be abandoned, and we avoid the second search for references to the variable that occurs in the free-standing inlineVariableReferences() method. (I've run the XSLT 3 test suite with this change and I'm getting a few failures (119) but I have a feeling they are unrelated. I'll come back to this later.)

In the case where there is a single (non-looping) reference to the variable, we ought to be able to combine the operation of getting the reference count with the operation of finding the reference and inlining it: we're currently searching the subtree twice in this case. (I would have expected the proportion of cases where we do the inlining to be sufficiently small that this doesn't matter, but the high incidence of inlineVariable() in the Java profile suggests otherwise).

Looking at this, I find it surprising that the method gatherVariableReferences() doesn't appear in the Java profile. This is called during LetExpression.typeCheck() and it's another method that does a full walk of the subtree. I would expect it to feature more strongly than inlineVariableReferences. Sure enough, adding a counter shows 13m calls! This is one of the mysteries of Java sampling - the selection of sampling points is not random and some frequently occurring code can be missed. But it's very strange, especially as this method is not cheap - it is appending to a list whenever a variable reference is encountered.

There are 7 uses of gatherVariableReferences of which 5 apply only to FLWOR expressions and are therefore XQuery-only. In nearly all cases, and certainly in the XSLT cases, the caller simply iterates over the returned list, which means it would be more efficient to replace gatherVariableReferences() that returns a list with a processVariableReferences() method that applies a supplied action to each reference.

Using processVariableReferences() in place of gatherVariableReferences() simply to implement LetExpression.refineTypeInformation() gives a further improvement in compile times: the last five are now 14.10 12.65 14.25 14.41 12.68.

This particular stylesheet does not make the common mistake of overusing "RTF-variables", that is variable containing an xsl:value-of instruction rather than a select attribute. I will change this optimization to use the processVariableReferences() method as well, though I don't expect it to make a difference in this case.

I'm surprised to discover that in many cases, on exit from LetExpression.typeCheck(), the value of refCount is -1, which means that it gets recomputed on entry to optimize(). The reason is that Atomizer.typeCheck() is unconditionally calling resetStaticProperties() (on the Atomizer expression and all its ancestors). The reason for this is apparently to induce parent expressions to recompute their static type, but in this case it seems to be destroying useful information. I suspect a call to resetLocalStaticProperties(), which does not affect the ancestors, would be adequate. But this doesn't help much: before optimizing a template or function we call ExpressionTool.resetPropertiesWithinSubtree() which kills the refCount data for all variables declared within the template or function.

So: how about having the resetPropertiesWithinSubtree() method compute an accurate refCount for all local variables in the template/function, in a single pass, for subsequent use by optimize()?

The improvement is not really measurable: 13.79 12.76 14.42 12.63 13.97

Back into JVisualVM: we're now seeing the top 3 methods as contains(), restoreParentPointers(), and inlineVariableReferences(), all at around 14%. The new processVariableReferences() comes in at 5%.

Actions

Copy link

#22

Updated by Michael Kay about 8 years ago

Removed the call to restoreParentPointers() from promote() (which is only there as defensive programming). Last 5 timings now

11.46 9.93 10.80 9.84 10.79

Looking good!

JVisualVM now shows contains() and inlineVariableReferences() at about 15% each.

Actions

Copy link

#23

Updated by Michael Kay about 8 years ago

We have about 9000 external calls of inlineVariableReferences(), leading to some 38m internal calls - so each external call is searching an expression tree of about 4000 nodes.

I'm experimenting with having the method resetPropertiesWithinSubtree() (called at template/function level between typeCheck() and optimize()) maintain a list of references from an Assignation to its VariableReferences, rather than merely a count. Using this information to do the inlining, rather than doing another search.

Surprisingly, there are now 45m internal calls if inlineVariableReference()! It seems these are happening as a secondary effect of a different optimization: we are extracting subexpressions that don't depend on a control variable from outside the scope of the variable, and this is leading to expressions of the form let $x := $generatedVar return XXX; we are then inlining references to $x within XXX. I've changed this to make use of the reference list for $x. New last-5 timings:

7.39 7.93 8.37 6.90 8.47

This is good enough to close the issue (after we have tested the changes more carefully!) but I want to try one more thing. It turns out there are more "flattened result tree fragment" variables than I realised (about 9100 variables whose value is a result-tree fragment, for which about 8700 qualify for rewriting because all references to the variable are in atomizing contexts.

Final results:

8.04 7.68 6.38 7.36 7.95

Or for a single run from the command line, 11.8s compared with the original 29.3s

Note that this is all with bytecode generation disabled. With bytecode on, it is still very slow.

Actions

Copy link

#24

Updated by Michael Kay about 8 years ago

Test case number-1901 is failing. The test case uses current() within a predicate of the count pattern of xsl:number. current() has been replaced by a variable reference, it seems the slot number of the variable reference and the slot number of the binding are different. Only failing when stylesheet is exported and reloaded.

Without -export, I'm seeing 208 test failures.

A good example of a failure is select-1704. What has happened here is that in the course of optimizing a LetExpression, one of the variables on the LetExpression.references list has somehow been replaced in the expression tree with a different variable reference, and the inlining operation is updating the wrong LocalVariableReference. (This is the classic danger of retaining extra data on the expression tree to improve performance: it needs to be maintained and updated through tree rewrites.)

Down to 64 failures after a couple of fixes: (a) making sure that when we create a new local variable as part of loop-lifting optimizations, the new variable is added to its reference list; (b) avoiding duplicates on the reference list; (c) tolerating references on the reference list that have become detached from the tree as a result of previous rewrites.

Actions

Copy link

#25

Updated by Michael Kay about 8 years ago

Down to 28 failures after fixing an unrelated problem caused by the patch for bug 2710 being incomplete. Quite a few of these 28 seem to be new tests added since 9.7 was released, but we need to go through them all.

Test function-1013 is crashing as a result of an incorrect rewrite. The expression subscript($x, $current) has been rewritten as subscript($x, .) by inlining $current, without recognizing that this makes the expression dependent on the focus and therefore ineligible for loop-lifting out of a focus-controlled loop. This turns out to be caused by a failure to reset dependency information all the way up the expression tree when the rewrite is done. Fixing this in Expression.adoptChildExpression() doesn't work because on some paths the parent pointers up the tree have not yet been established when this is called. So tried fixing it in LetExpression.inlineReferences() instead.

Unfortunately we now have 32 failures... The new ones seem to be in xsl:evaluate.

Actions

Copy link

#26

Updated by Michael Kay about 8 years ago

The problem in xsl:evaluate seems to be that in the absence of a with-params attribute, Saxon creates the default attribute with-params="map{}"; it then rewrites this as with-params="$ggg where $ggg is a global variable initialized to "map{}". So far so good, but it then somehow fixes up to the wrong global variable.

The immediate cause of this is that the evaluateInstr.copy() is failing to make a copy of the dynamicParams subexpression, so it is referenced from two different places in the expression tree, which means that the second binding slot allocated overwrites the first one. This appears to be an existing bug unrelated to these changes; except that it's exposed by the fact that we are doing some rewrites incorrectly: the code works correctly, but it contains unnecessary bindings to local variables that are never used.

Fixed that; reran the full test suite. While fixing the above I had added a diagnostic assertion to check for binding two slot numbers to the same variable reference; hit this assertion in test import-1201. The problem here seems again to be an existing bug unrelated to these changes; when a template rule uses a union pattern, it appears more than once in the indexes held at the Mode level, and is processed more than once when allocating binding slots, which is wrong because the template body is shared. In this case the error triggered by the new assertion is probably spurious; no harm would be done by allocating a new slot number, other than wasting slots. But it's wasted work, so I will stop it happening.

Back to 28 failures.

Actions

Copy link

#27

Updated by Michael Kay about 8 years ago

The (new) code to reset static properties after inlining was wrong. Fixed that. Now 20 failures. Of these all but 6 are also failing on a build without these changes. The remaining tests are cvt-015/6, snapshot-0104, stream-210/500, and try-029.

snapshot-0104 is doing an incorrect rewrite. Given

<xsl:variable name="origin" select="."/>
...
<xsl:sequence select="PATH [f:corresponds(., $origin)]"/>

it is rewriting the predicate as

<xsl:sequence select="PATH [f:corresponds(., .)]"/>

The reason is that for the variable reference $origin, the inLoop property is false, meaning that this is not considered to be a repeated reference.

The reason the inLoop property is false is rather interesting. It seems that for templates, we are running a typeCheck() over the whole template body before we run optimize(); but for functions we are only doing local type-checking at the level of individual XPath expressions. During this fragmented typeCheck, a variable reference is checked without access to its Binding, so it cannot update the reference count for its binding. Adding the global type check fixes the problem. A bit scary that we passed all but six tests with such a glaring hole; and it means we ought to check the code for components other than templates and functions (e.g. attribute sets and global variables).

Test try-029 is failing because a local variable is being inlined into a try/catch block, so an error in evaluating the variable is caught when it should not be. The reason is that the code for inlining a variable of the form (let $x := $y return Z) is unconditional; it does not depend on the (nominal) reference count. The answer is to avoid inlining at this point if the flag "needsEagerEvaluation" is set. Also: the needsEagerEvaluation flag needs to be copied when the LetExpression is copied.

Test stream-210 is failing a streamability test because of an incorrect rewrite: again, a variable bound to "." has been inlined into a predicate where the focus is different. The logic here is convoluted. Initially, the plurality of the variable references is recognized and results in a correct refCount. Then the typechecking replaces "." by "data(.)" as the select expression for the variable; this causes a resetLocalProperties(), which sets the refCount to -1. Then a section of the expression tree is typechecked again as a result of changes, and this processes one of the variable references (which is not in a loop), causing the refCount to be set to 1, which is the value it has on entry to optimize(). This is all beginning to look a bit fragile... Solved by putting another call of resetPropertiesWithinSubtree() (for the template root) between the typecheck phase and the optimize phase. Need to follow the same protocol for other component kinds.

Test stream-500 fixed by the same change, as are cvt-015 and cvt-016. So we now have all tests passing, but we need to make things a bit more robust and add a few more tests before I'm confortable.

Actions

Copy link

#28

Updated by Michael Kay about 8 years ago

I've consolidated the changes and copied them across to the 9.8 branch. All (relevant) XSLT tests are passing but some XQuery tests are crashing. Also, the XSLT tests with -export option show 29 failures and these need to be checked.

A couple of the export failures seem to be a bug that has been unwittingly exposed by these changes: see bug 2737.

Test evaluate-027 is failing in export mode. The immediate problem is that the retainedStaticContext of the evaluate instruction has no base URI information. For some reason (a) the code has changed so the export of the evaluate instruction now gives it an "ns" attribute whereas it did not have one previously, and (b) having an "ns" attribute causes the evaluate instruction to get its own retainedStaticContext rather than inheriting that of its parent instruction; and its own RSC does not get a base URI attribute. The reason evaluateInstr.export() is generating an "ns" attribute is that it thinks the in-scope namespaces of the evaluate instruction are different from the in-scope namespaces of its parent instruction; which is because the in-scope namespaces of the parent apparently include namespace bindings for "a", "b", and "c", which have no business being there. It's not clear what exactly has changed to cause this, it seems to be a remote consequence of doing things in a slightly different order. Expression.adoptChildExpression() copies the static context of the child to the parent if the parent doesn't have one; this is a useful fallback strategy but is wrong if the parent static context is materially different. Fixed this by setting the retained static context during LiteralResultElement.compile(), but not sure this is the final answer.

Actions

Copy link

#29

Updated by Michael Kay about 8 years ago

I have committed the changes on the 9.8 branch but not yet on 9.7. Given the nature of the bugs preventing the last few tests passing, I don't feel the changes are yet robust. I'm now doing some experiments on the 9.8 branch to see if I can improve the robustness. In particular I'm trying to "normalize" the data we keep about references in the Assignation object, and to introduce code that verifies the reference list if the expression tree is changed in the course of optimization.

Actions

Copy link

#30

Updated by Michael Kay about 8 years ago

I have done further work on the 9.8 branch and acquired sufficient confidence to copy it across to the 9.7 branch. The effect of the changes is that a LetExpression now maintains a reference list rather than a reference count, and before doing anything critical we verify that the reference list is sound.

During testing of these changes another issue emerged, which is not directly related. The design intent is that streamability analysis should be done after optimization is complete. However, when an xsl:stream instruction occurs within xsl:function, the streamability analysis is done when the optimization pass hits the xsl:stream instruction, and subsequent (re-)optimization of parent instructions within the xsl:function may trigger rewrites which affect the body of the xsl:stream instruction and hence invalidate the streamability analysis. I have largely resolved these issues by processing xsl:stream instructions in a separate phase after optimize() is complete. Nearly all tests are now passing and I am committing the changes in this state. There are one or two streaming tests failing because xsl:stream instructions get extracted into global variables, and streamability analysis is not done for system-generated global variables. I'm treating this as a separate bug (bug #2753) since I think it is already present in the product without these changes.

Actions

Copy link

#31

Updated by Michael Kay about 8 years ago

Status changed from In Progress to Resolved
Applies to branch 9.8 added
Fix Committed on Branch 9.7, 9.8 added

Closing this as resolved. With the changes applied, the last five compile times are

6.85 7.82 6.89 7.80 7.04

Actions

Copy link

#32

Updated by O'Neil Delpratt about 8 years ago

Related to Bug #2737: Failure to reload an exported streaming stylesheet added

Actions

Copy link

#33

Updated by O'Neil Delpratt about 8 years ago

Status changed from Resolved to Closed
% Done changed from 0 to 100
Fixed in Maintenance Release 9.7.0.6 added

Bug fixed in maintenance release 9.7.0.6.

Actions

Copy link

#34

Updated by O'Neil Delpratt about 7 years ago

Applies to branch trunk added
Applies to branch deleted (~~9.8~~)

Actions

Copy link

#35

Updated by O'Neil Delpratt about 7 years ago

Fix Committed on Branch trunk added
Fix Committed on Branch deleted (~~9.8~~)

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #2707

Performance regression compiling very large stylesheets

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by Michael Kay about 8 years ago

Updated by O'Neil Delpratt about 8 years ago

Updated by O'Neil Delpratt about 8 years ago

Updated by O'Neil Delpratt about 7 years ago

Updated by O'Neil Delpratt about 7 years ago