Bug #3209
closedRegression in performance between Saxon 9.6 and 9.7
0%
Description
I will send you privately an XML document and an XSLT stylesheet. The stylesheet is really huge, it is automatically generated from Schematron with quick fixes.
Applying the XSLT over the XML with Saxon 9.6.0.10 took 6 seconds. With Saxon 9.7.0.15 it takes 14 seconds.
Updated by Radu Coravu over 7 years ago
Also it did not matter if I used HE, PE or EE, with optimizations fully enabled or fully disabled.
Updated by Michael Kay over 7 years ago
- Subject changed from Difference in performance between Saxon 9.6 and 9.7 to Regression in compile-time performance between Saxon 9.6 and 9.7
- Category set to Performance
- Status changed from New to In Progress
A similar problem with compile-time performance was reported recently by Thomas Berger: see https://saxonica.plan.io/boards/3/topics/6769?r=6779 He was seeing 4s compile time for a 2Mb stylesheet, which isn't that dissimilar from the 14s you are seeing for 15Mb. I'm afraid we didn't make very much progress on that one (there weren't any obvious hotspots); but perhaps a more focused comparison of 9.6 versus 9.7 will get us somewhere.
Updated by Radu Coravu over 7 years ago
I'm not sure if the stylesheet compilation is the main problem. I did not determine precise timings of how much the XSLT compilation takes and how much applying the XSLT on the XML takes but from what I profiled applying the XSLT over the XML definitely takes more time (maybe 80%-20%) than the XSLT compilation stage.
For applying Schematron in Oxygen we usually reuse the entire Transformer for subsequent validations so we are probably more interested in the time it takes to apply the XSLT over the XML.
Updated by Michael Kay over 7 years ago
With 9.7 under JDK 1.6 running net.sf.saxon.Transform with options
-xsl:glossary_rules-compiled.xsl -nogo -t -repeat:5 --generateByteCode:off
I'm seeing
Stylesheet compilation time: 9.534221s (9534.221ms)
Stylesheet compilation time: 3.687721s (3687.721ms)
Stylesheet compilation time: 2.801489s (2801.489ms)
Stylesheet compilation time: 2.817711s (2817.711ms)
Stylesheet compilation time: 2.715733s (2715.733ms)
That is, the compile time stabilises at about 2.8s after Java warmup.
Switching to Java 8 I see:
Stylesheet compilation time: 4.796798s (4796.798799ms)
Stylesheet compilation time: 3.379123s (3379.123175ms)
Stylesheet compilation time: 2.481732s (2481.732736ms)
Stylesheet compilation time: 2.364391s (2364.391349ms)
Stylesheet compilation time: 1.974992s (1974.99221ms)
These figures are so much better than yours that I'm wondering what the significant difference is.
With Saxon 9.6 and JDK 1.6 I'm seeing:
Stylesheet compilation time: 6.181352s (6181.352ms)
Stylesheet compilation time: 2.683527s (2683.527ms)
Stylesheet compilation time: 2.443341s (2443.341ms)
Stylesheet compilation time: 2.513528s (2513.528ms)
Stylesheet compilation time: 2.289807s (2289.807ms)
and with Saxon 9.6 and Java 8 it becomes:
Stylesheet compilation time: 4.458636s (4458.63601ms)
Stylesheet compilation time: 3.193878s (3193.878162ms)
Stylesheet compilation time: 1.853771s (1853.771537ms)
Stylesheet compilation time: 1.862288s (1862.28826ms)
Stylesheet compilation time: 1.429746s (1429.746265ms)
If I run with 9.7.015 as issued, I get:
java.lang.NullPointerException: Internal error: expression () has no retained static context
at net.sf.saxon.expr.Expression.getPackageData(Expression.java:1129)
at net.sf.saxon.expr.instruct.Choose.optimize(Choose.java:485)
which is bug #3177 - I guess you have this fixed in your build.
So: we are seeing some regression in compile-time performance from 9.6, but it's not nearly as great as you are reporting. It would be good to find out why that is.
I also did a run with the current 9.8 build, just out of curiosity:
Stylesheet compilation time: 4.34822s (4348.220607ms)
Stylesheet compilation time: 961.056728ms
Stylesheet compilation time: 666.578541ms
Stylesheet compilation time: 591.028138ms
Stylesheet compilation time: 692.596555ms
(must get rid of this spurious precision...)
It's going to be much more difficult to get meaningful compile time numbers for 9.8 because it introduces JIT compilation where template rules aren't compiled until the first time they fire. If I switch JIT off I get numbers fairly similar to those for 9.7:
Stylesheet compilation time: 5.717462s (5717.462016ms)
Stylesheet compilation time: 4.336657s (4336.657088ms)
Stylesheet compilation time: 3.069848s (3069.848049ms)
Stylesheet compilation time: 2.434434s (2434.434849ms)
Stylesheet compilation time: 2.588826s (2588.826526ms)
Updated by Michael Kay over 7 years ago
- Subject changed from Regression in compile-time performance between Saxon 9.6 and 9.7 to Regression in performance between Saxon 9.6 and 9.7
OK, thanks. I focused on compile-time performance because with a huge stylesheet and a small source document that's usually where the problems lie. But I'll now take a look at run-time performance too.
Updated by Michael Kay over 7 years ago
Looking now at execution time, again with -repeat:5, with 9.7 and Java 8 I'm seeing:
Average execution time over last 4 runs: 12.23548s (12235.480724ms)
(individual times 12.58, 12.02, 12,32, 12.24, 12.34)
while 9.6 gives
* Average execution time over last 3 runs: 7.187672s (7187.67284ms)
But the 9.6 average is wrong, because the individual execution times are 5.54, 5.28, 5.38, 5.45, 5.43
So there's definitely a regression in run-time performance here.
For 9.8 with -jit:off I get compile time = 2.7s, execution time 11.36s
For 9.8 with -jit:on I get compile time = 615ms, first execution time 15.7s, subsequent execution 11.28s so it's essentially just shifting the compile cost to the first execution in this case.
Back in 9.7, running with ~~TP, the profile shows a hotspot in the template rule at line 15119 ~~ but although it's higher than any other rule, it's only 762ms out of a total of 14570ms.
There's no such hotspot for 9.6.
For both releases, the profile shows a large number of templates matching text nodes, each of which is being evaluated exactly 1057 times. In 9.7 most of these have an average net execution time of 25-50ms, in 9.6 the typical range is 10-20ms.
Looking at the match patterns, they take the typical DITA form
match="text()[not(ancestor::*[contains(@class,' glossentry/glossentry ') or contains(@class,' topic/keyword ') or contains(@class,' topic/xref ') or contains(@class,' topic/ph pr-d/codeph ') or contains(@class,' topic/pre pr-d/codeblock ') or contains(@class,' topic/ph sw-d/filepath ') or contains(@class, ' topic/ph sw-d/userinput ') or contains(@class, ' topic/ph sw-d/systemoutput ')])]"
Given a large number of template rules of this form, I would expect most of the cost to be in pattern matching. Since 9.8 is trying to optimize DITA patterns, I'm disappointed it isn't doing better - perhaps it's the not() that throws it.
Looking now at the Java profile (for 9.7), calls on contains(), evaluation of Or expressions, and following the attribute axis certainly feature strongly as one would expect from these patterns, but the surprise is that regular expression evaluation accounts for a lot of the cost. Indeed, there seems to be a lot of time spent compiling regular expressions.
In the 9.6 Java profile, there's also a lot of regular expression activity, but it's nearly all evaluation of regexes, rather than compilation. So perhaps 9.6 is managing to reuse compiled regular expressions where 9.7 isn't?
Updated by Michael Kay over 7 years ago
Adding instrumentation internally, in Saxon 9.7 we are compiling 1049000 regular expressions: 5000 during stylesheet compilation, the rest during execution. In 9.6 we are only compiling 6000 regexes, all at stylesheet compile time.
So this is definitely pointing us to where the difference lies.
Looking at the first run-time call on RECompiler.compile(), it derives from execution of the expression at line 2232:
<xsl:if test="test="matches(.,concat('(^|\W)','(1080p)','(\W|$)'),'i')">
So there's really no reason we shouldn't be able to compile this regular expression statically, even though it isn't actually written as a string literal - but it's probably the fact that it's written as a concat() that's throwing us.
Updated by Radu Coravu over 7 years ago
Keeping my fingers crossed, hopefully you'll be able to find a fix for increasing the speed again. Maybe it would be worth having an automated test run on these samples or a similar set of samples, and see from release to release if the speed has degraded and with how much.
Updated by Michael Kay over 7 years ago
I've fixed matches() so it now precompiles this regular expression (by adding the method RegexFunction.makeOptimizedFunctionCall()). This eliminates all run-time compilation of regular expressions
Average execution time is now 8.85ms. So we've eliminated about half the regression. But we still haven't got it down to the 9.6 figure of 5.4s.
Updated by Michael Kay over 7 years ago
Radu, we do have performance regression tests (the Speedo benchmark we described at XML London a couple of years ago). What we are finding though is that a significant number of performance problems slip through the net - most performance issues turn out to be dependent on a very particular set of circumstances, and the more special cases we optimize, the more cases we need to test. So we certainly need to expand the set of performance tests we run, but I'm not confident we can prevent all such incidents that way.
Updated by Michael Kay over 7 years ago
The template rule at line 15119 now stands out even more strongly, though it's still only 1000ms out of 15000. Unfortunately this rule is quite long and complex, so it's not clear where the cost is going more specifically. The number of executions of this template is the same as in 9.6 (namely 1057), but the net execution time is 1004ms compared with 10.7ms, so there's definitely something wrong in this area: the difficulty is seeing what. The template appears to be dominated by calls on xsl:analyze-string.
Further tracing shows that there is some run-time compilation of the regular expressions used by xsl:analyze-string, but only about a dozen calls, and not in the relevant area. They are calls that could be avoided, for example line 203639 which does:
<xsl:variable name="regex"
select="concat('(^|\W)','(Waveform\s+Audio\s+Format|WAV)','(\W|$)')"
as="xs:string"/>
<xsl:analyze-string select="." regex="{$regex}" flags="i">
but it doesn't look to me as if they are relevant.
Updated by Radu Coravu over 7 years ago
The regexp fix you made does not seem to help much.
I'm testing with the following code:
public static void main(String[] args) throws MalformedURLException, TransformerException {
long before = System.currentTimeMillis();
Transformer tr = new net.sf.saxon.TransformerFactoryImpl().newTransformer(new SAXSource(new InputSource("file:/C:/Users/radu_coravu/Desktop/saxonProfile/glossary_rules-compiled.xsl")));
long after = System.currentTimeMillis();
System.err.println("Create transformer " + (after - before));
before = System.currentTimeMillis();
tr.transform(new SAXSource(new InputSource("C:/Users/radu_coravu/Desktop/saxonProfile/t_setting_loudness_peak_value_test.xml")), new StreamResult(new StringWriter()));
after = System.currentTimeMillis();
System.err.println("RUN transformer " + (after - before));
}
I removed the DOCTYPE declaration from the XML (I also tested with XML catalog support and DOCTYPE in XML, timings were similar).
On my computer which is quite fast using the Saxon 9.7 JARS bundled with Oxygen the code above gives the timing:
Create transformer 8149
RUN transformer 10201
So 8 seconds to create, 10 to run.
Updated by Radu Coravu over 7 years ago
The above results were with Java 1.6. With Java 1.8 I get lower times:
Create transformer 7452
RUN transformer 8976
Without the RegexFunction patch you recently made the timing is:
Create transformer 7364
RUN transformer 13896
so yes, despite of what I previously said it seems that the RegexFunction patch is worth it.
Updated by Radu Coravu over 7 years ago
Timings for the code above run with Saxon 9.6
- With Java 1.6 (but who cares anymore about Java 1.6..):
Create transformer 6003
RUN transformer 7270
- With Java 1.8:
Create transformer 6570
RUN transformer 5954
So with Java 1.8 Saxon 9.7 takes about 3 seconds more than Saxon 9.6 when running the XSLT on the XML.
But your RegexFunction fix seems to have shaved about 4 seconds off the original larger difference in timing.
Updated by Michael Kay over 7 years ago
In tracing, I noticed that towards the end of the template, the instruction
<xsl:apply-templates select="@*|*|comment()|processing-instruction()" mode="M31"/> (line 15508)
is being executed even though the select expression is clearly void (the context item is a text node). Digging deeper, I noticed that when a SimpleStepExpression ./X is type-checked and X is found to be void, it returns the SimpleStepExpression ./() rather than (), which inhibits further optimisation.
I also noticed that an expression that reduces to (if (()) then X else Y) is not being simplified to Y: this is because Choose.removeRedundantBranches() is only checking for constants true and false, not for other constants whose effective boolean value is statically known. When dealing with machine-generated code, such opportunities arise a lot.
I have made these improvements, but as one might expect, they have little effect on the bottom line since they are eliminating expressions that are very quick to evaluate. The main reason for doing such tweaks is that they sometimes enable a more powerful optimization later.
I decided to focus on the calls on fn:matches(). I added instrumentation to both 9.6 and 9.7 to count the number of calls, and the elapsed time, in calls to match, by line number. I confirmed that both 9.6 and 9.7 are performing the same number of calls on matches() (1044316 calls), and the total cost of these calls is actually higher in 9.6 (3.28s versus 2.99s). So this doesn't appear to be the cause of the trouble. It does show, however, that calls to matches() are a significant part of the overall execution cost.
Updated by Michael Kay over 7 years ago
I decided to look at the time spent in template rule matching for apply-templates. I put timing calls around the call to getRule() in Mode.applyTemplates.
Total time spent rule matching in 9.6 - 0.79s
Total time spent rule matching in 9.7 - 4.05s
This is roughly the same as the difference in total execution time between 9.6 and 9.7, so I think we're getting closer.
The number of rules being matched is essentially the same in both cases (815901 / 815902), suggesting that the difference is in the cost of pattern matching rather than the algorithm for deciding which patterns to test.
Updated by Michael Kay over 7 years ago
Rather surprisingly, it emerges that only two patterns are being matched:
(a) The pattern child::node(), which is matched 291954 times, taking a total of 14.5ms on both releases
(b) The complex pattern in mode M32 on line 15528:
match="text()[not(ancestor::*[contains(@class,' glossentry/glossentry ') or contains(@class,' topic/keyword ') or contains(@class,' topic/xref ') or contains(@class,' topic/ph pr-d/codeph ') or contains(@class,' topic/pre pr-d/codeblock ') or contains(@class,' topic/ph sw-d/filepath ') or contains(@class, ' topic/ph sw-d/userinput ') or contains(@class, ' topic/ph sw-d/systemoutput ')])]"
In 9.6 we see count='522158' time='773074356'
In 9.7 we see count='522158' time='3885149906'
So the extra cost is accounted for by the cost of evaluating this pattern.
But both releases are making 24632816 calls on fn:contains(), which one would think is the main ingredient.
There is a difference in the way the predicate is compiled in the two releases: in 9.6 the not() call is turned into an empty() call. This suggests that 9.7 is failing to call optimize() on the pattern, which could have other adverse effects.
I have reinstated the optimization that converts not() calls to empty() calls (and that removes any requirement for reordering of nodes in the operand expression). The compiled pattern now appears to take the same form as in 9.6. However, this change has no effect on the bottom line execution time: the patterns are still taking a lot longer to evaluate, and it's not at all clear why.
At this point it may well be worth creating a cut-down stylesheet to use for further investigation of the problem.
Updated by Michael Kay over 7 years ago
Current state:
With -repeat:5 --generateByteCode:on under Java 8 I'm seeing
9.7: Stylesheet compilation time: 2.952422s
Average execution time over last 4 runs: 19.937185s
9.6: Stylesheet compilation time: 3.364971s
Average execution time over last 3 runs: 6.644918s
With --generateByteCode:off it is:
9.7: Stylesheet compilation time: 1.929447s
Average execution time over last 4 runs: 19.008467s
9.6: Stylesheet compilation time: 1.645436s
Average execution time over last 3 runs: 6.109873s
So we seem to have done something to make things worse....! (I've confirmed, however, that the rewrite of not() to empty() is a small improvement). It's possible that the extra costs are due to instrumentation left lying around in the code.
I was hoping bytecode would improve matters because tracing through the pattern evaluation in the debugger, the main difference between 9.6 and 9.7 seems to be extra work navigating around the expression tree following from operands to their contained expressions, and this work isn't needed with bytecode enabled. But I'm not sure we are generating bytecode for pattern predicates anyway.
The Java profile for the 9.7 interpreted code shows nothing unexpected: it's spending its time in the places we would expect when evaluating these patterns.
Updated by Michael Kay over 7 years ago
I have got the 9.7 performance down to 6.01s (better than the 9.6 figure) by introducing a "fast path" call of fn:contains().
This is basically bypassing the general call() mechanism of SystemFunctionCall/SystemFunction whereby the arguments are evaluated as generic sequences. Prior to 9.6, every system function had a specialist class for static invocation that evaluated the arguments in a way appropriate to that function; with the introduction of dynamic calls this was becoming unwieldy and unmaintainable because we needed separate implementations of each function for static and dynamic calls. For the most common and critical functions, like string() and not() and exists(), we retained specialist call classes: but not for contains().
Although this change solves the specific use case, and I am therefore committing it on the 9.7 and 9.8 branches, I shall leave the bug open for the time being because I think we need to review the general principles.
This use case would also greatly benefit from extraction of common subexpressions (the @class references), which we have started doing for some particular scenarios where it looks promising, and it's worth looking at that opportunity for 9.8,
Updated by Radu Coravu over 7 years ago
I added the patches on my side and I confirm your timings.
What's funny is that if I un-comment the DOCTYPE in the XML and apply the XSLT processing using the DITA XML catalogs, I get an even better timing for applying the XSLT of about 4,5 seconds, probably because the @class attributes start showing up on the elements and they start matching more diverse templates.
For your convenience I will also attach to an email the same XML file with @class attributes expanded in place (passed through a copy stylesheet).
We'll also run some automated tests on our side, see if our set of XSLT tests catch any side effects of the optimizations. We also have quite a lot of DITA Open Toolkit tests, publishing DITA to HTML and asserting some outputs and sometimes in the past by side effect they caught various exceptions coming from the Saxon code when processing the rather complex DITA XSLTs.
Updated by Michael Kay over 7 years ago
The fact that the "fast path call" on contains() makes such a big difference comes as a bit of a surprise. It suggests that we ought to look at the generic call method on system functions and see if it can be improved. Perhaps we should be doing something like we do for calls on user functions, with a range of different evaluation methods for the arguments: at the moment we special-case literals and variable references, but everything else uses
new LazySequence(arg[i].iterate(context))
which is a bit heavy-handed especially when the expected type is a simple atomic value.
Updated by Michael Kay over 7 years ago
On the 9.8 branch, with the improvements described above, I'm seeing an execution time of 5.15s.
I've done some experiments to introduce extraction of common subexpressions (currently just within the predicate of a filter expression, to try things out). This is correctly identifying convertUntyped(data(@class)) as a common subexpression, and binding it to a local variable. This change brings a further improvement in the timing to 4.57s.
Applying this optimization in the right places without adverse impact on compile time requires further work, but the initial results certainly look promising.
Updated by Radu Coravu over 7 years ago
I added the fixes as patches on my side for Saxon 9.6, right now we found this stack overflow in our DITA publishing caused by the patches:
BUILD FAILED
D:\projects\trunk\eXml\frameworks\dita\DITA-OT2.x\build.xml:45: The following error occurred while executing this line:
D:\projects\trunk\eXml\frameworks\dita\DITA-OT2.x\plugins\com.oxygenxml.webhelp.classic\build_dita.xml:85: The following error occurred while executing this line:
D:\projects\trunk\eXml\frameworks\dita\DITA-OT2.x\plugins\org.dita.base\build_preprocess.xml:207: java.lang.StackOverflowError
at java.util.ArrayList.iterator(Unknown Source)
at net.sf.saxon.expr.Assignation.addReference(Assignation.java:426)
at net.sf.saxon.expr.LocalVariableReference.copy(LocalVariableReference.java:70)
at net.sf.saxon.expr.SystemFunctionCall.copy(SystemFunctionCall.java:220)
at net.sf.saxon.expr.SlashExpression.copy(SlashExpression.java:629)
at com.saxonica.ee.optim.OptimizerEE.convertPathExpressionToKey(OptimizerEE.java:493)
at net.sf.saxon.expr.SlashExpression.optimize(SlashExpression.java:352)
at net.sf.saxon.expr.Operand.optimize(Operand.java:200)
at net.sf.saxon.expr.FilterExpression.optimize(FilterExpression.java:301)
at net.sf.saxon.expr.Operand.optimize(Operand.java:200)
at net.sf.saxon.expr.SlashExpression.optimize(SlashExpression.java:327)
at net.sf.saxon.expr.Operand.optimize(Operand.java:200)
at net.sf.saxon.expr.FilterExpression.optimize(FilterExpression.java:301)
at net.sf.saxon.expr.Operand.optimize(Operand.java:200)
Updated by Radu Coravu over 7 years ago
Do you need a sample XSLT stylesheet for this or does the stack trace provide enough information about why this may happen?
Updated by Radu Coravu over 7 years ago
The latest changes in the net.sf.saxon.expr.instruct.Choose seem to be responsible for the StackOverflowError.
Updated by Radu Coravu over 7 years ago
Also the optimization fixes in "net.sf.saxon.functions.Contains" generate at publish time warnings like:
[xslt] *** Bad parent pointer found in " topic/pre " at 908 ***
Updated by Michael Kay over 7 years ago
- Status changed from In Progress to Resolved
- Assignee set to Michael Kay
I think it's time to close this thread. I've read through it carefully: in the course of investigation we made quite a few improvements, some on the 9.7 branch and some on 9.8, and most of the ideas that we came up with were implemented in 9.8. If there are further things that need to be done, it would be best to start a new thread.
Updated by Radu Coravu over 7 years ago
So about my last comments:
The latest changes in the net.sf.saxon.expr.instruct.Choose seem to be responsible for the StackOverflowError.
Also the optimization fixes in "net.sf.saxon.functions.Contains" generate at publish time warnings like:
[xslt] *** Bad parent pointer found in " topic/pre " at 908 ***
If those changes made to "net.sf.saxon.expr.instruct.Choose" and "net.sf.saxon.functions.Contains" for the purpose of speed improvement made it in the Saxon 9.7 branch then there is a problem.
Please register to edit this issue