Bug #3885
closedPerformance regression from Saxon 9.1 to 9.8 (caused by space-stripping)
100%
Description
Reported by direct email.
SUMMARY: A user workload showed performance regression of about 25% from Saxon 8.9 to Saxon 9.8. The result of investigation showed that although whitespace was being stripped from the source tree during document building, an unnecessary space-stripping wrapper was being added over the source tree, slowing down all navigation within the tree.
Files
Updated by Michael Kay about 6 years ago
Running the supplied workload successfully.
It's running the same transformation repeatedly in 10 different threads, using the same source document, but rebuilding the tree for each transformation (presumably to simulate processing different source documents).
Initial observations:
No obvious hotspots (or other surprises) in JProfiler; no sign of any excessive namepool access.
Each thread starts by compiling the stylesheet and then putting it in the cache, so it's compiled 10 times, which had me confused at first, but I don't think it's doing any real harm or affecting the measurements significantly.
I'm a little surprised by the "burstiness". I added a message immediately before each transformation starts, and it looks like this:
10 transformations starting in 10 different threads in the space of 150ms
Start transformation 10 in thread 13 at 1535619841709
Start transformation 11 in thread 19 at 1535619841711
Start transformation 12 in thread 18 at 1535619841734
Start transformation 13 in thread 20 at 1535619841759
Start transformation 14 in thread 14 at 1535619841816
Start transformation 15 in thread 11 at 1535619841818
Start transformation 16 in thread 15 at 1535619841826
Start transformation 17 in thread 16 at 1535619841827
Start transformation 18 in thread 17 at 1535619841852
Start transformation 19 in thread 12 at 1535619841863
Now a gap of 3400ms, followed by another 10 transformations starting in 10 threads in the space of 150ms
Start transformation 20 in thread 20 at 1535619845222
Start transformation 21 in thread 13 at 1535619845322
Start transformation 22 in thread 14 at 1535619845326
Start transformation 23 in thread 18 at 1535619845327
Start transformation 24 in thread 12 at 1535619845344
Start transformation 25 in thread 16 at 1535619845352
Start transformation 26 in thread 11 at 1535619845374
Start transformation 27 in thread 17 at 1535619845375
Start transformation 28 in thread 19 at 1535619845376
Start transformation 29 in thread 15 at 1535619845393
And then the same again (though the gap is only 2000ms this time):
Start transformation 30 in thread 20 at 1535619847305
Start transformation 31 in thread 16 at 1535619847351
Start transformation 32 in thread 13 at 1535619847360
Start transformation 33 in thread 18 at 1535619847360
Start transformation 34 in thread 19 at 1535619847361
Start transformation 35 in thread 17 at 1535619847365
Start transformation 36 in thread 12 at 1535619847373
Start transformation 37 in thread 11 at 1535619847374
Start transformation 38 in thread 14 at 1535619847375
Start transformation 39 in thread 15 at 1535619847376
And so it continues, quite systematically (the gap is 1700 on this occasion).
Start transformation 40 in thread 20 at 1535619849072
I can't see any obvious reason why the threads should be "marching in step" like this.
I'm not comfortable with the way the application is creating a new TransformerFactory for every transformation, sharing the namepool but not the configuration. The JProfiler doesn't show up any obvious problems with this, but it's an unusual way of operating and I think I need to look at whether it causes any synchronization effects.
Updated by Michael Kay about 6 years ago
Further tracing shows that there is no delay between a thread finishing one transformation and starting the next. During the gaps between bursts of messages, all 10 threads are busy doing transformations. The odd thing is that they all finish at pretty well exactly the same time, without the randomness that might be expected. Perhaps I should experiment by inserting a random wait time for each thread before starting the next transformation.
Updated by Michael Kay about 6 years ago
I cut out the unnecessary creation of a new TransformerFactory for each transformation (it's not even used, unless the stylesheet needs to be recompiled) and this reduced the average processing time in each thread from 1800ms to 1600ms. The threads are still in step.
Updated by Michael Kay about 6 years ago
I'm not at all convinced there is a multithreading problem here.
Measuring throughput (transformations per second) against number of threads I'm seeing
THREADS THROUGHPUT
1 1.29
2 2.32
4 3.37
6 3.93
8 4.06
12 3.57
16 2.62
32 2.52
which seems an entirely reasonable profile for a CPU-dominated workload on a 4-core machine. Adding more threads beyond 8 sees a gradual decline in performance because the CPU is already running at 100% so adding more threads just creates memory pressure and synchronization overheads. There's no evidence here of any resource contention at the application level.
I'm now going to take a look at the single-threaded performance of the transformation comparing 9.8 with earlier releases.
Updated by Stefan Fritz about 6 years ago
The tests I ran were done on a cpu with 6 physical cores i7 8700K. If I reduce threads to 6 I get 530ms avg processing time. With 3 threads 435ms and 350ms with a single thread.
The other aspect is that the same transformation only took on avg. 265ms with Saxon 8.9.04.
So we are faced with 350ms vs 265ms. Have to admit we saw the same performance degradation with 8.9.0.4 when we add threads. I cannot tell either whether that is expected/normal with multiple threads was just not expecting double processing time with 10 vs 1 thread.
Updated by Michael Kay about 6 years ago
I think throughput is a much more useful measure than elapsed time. If we assume for simplicity that the CPU is 100% occupied when the number of threads equals the number of cores, then beyond that point, assuming no scheduling overhead or memory pressure, doubling the number of threads will give no improvement in throughput, and will therefore double the elapsed time of each transformation. In the real world there is scheduling overhead and memory pressure, so doubling the number of threads will increase the elapsed time by more than a factor of 2.
Updated by Michael Kay about 6 years ago
- Status changed from New to In Progress
Now measuring single-thread performance of the transformation under 9.8.
Running from the command line with -repeat:20 I get average execution time of 372.15ms.
Switching off bytecode generation has no noticeable effect.
Switching JIT off increases the execution time slightly (to 397.31ms).
Switching all optimizations off increases the execution time to 431ms.
Profiling with -TP:profile.html shows no very marked hotspots in the stylesheet. The most visited templates/functions are (with counts and total net time):
GlobalTypes.xsl#575 - 755/301ms
Shared.xsl#78 - 8213/179ms
Namespace.xsl#40 - 12174/167ms
Shared.xsl#93 - 3430/113ms
The only thing that's notable here is that there are a lot of templates matching with match="*" (presumably in different modes).
Unfortunately the profile truncates the stylesheet module URI to .../Core/Shared.xsl
and there are a number of modules that match this abbreviated URI.
Updated by Michael Kay about 6 years ago
9.7 under the same conditions gives me an average execution time of 381ms.
9.6 crashes ("Cannot copy a variable reference whose binding is unknown"). May be a problem with my build status: not worth getting diverted into debugging this.
9.5 runs in 681ms (!). Ran it again to be quite sure: 671ms. Switched to using Java 8: 556ms.
9.4 - not run, my IntelliJ project needs cleaning up.
9.3 - crashed with a NullPointerException in Expression.getHostLanguage()
9.2 runs in 438ms
9.1 runs in 533ms
I don't have any IntelliJ projects on my machine for earlier releases than 9.1, but I'll try getting some execution times from the command line. So far, however, I'm not seeing any evidence of regression.
Updated by Michael Kay about 6 years ago
I'm now measuring the same transformation, again with -repeat:20, but this time directly from the terminal rather than in IntelliJ. Ran each one twice:
EE 9.8.0.14: 427.93ms, 382.61ms
EE 9.7.0.20: 485.37ms, 427.66ms
EE 9.6.0.9: crashed (as in IntelliJ)
EE 9.5.1.10: 463ms, 516ms
EE 9.4.0.6: 754ms, 759ms
EE 9.3.0.9: crashed (as in IntelliJ)
EE 9.2.1.5: 414ms, 412ms
SA 9.1.0.8: 519ms, 508ms
SA 9.0.9.3: 497ms, 493ms
Earlier release do not have the -repeat option, so it's not possible to get comparable results.
There's no evidence of regression here, though the variation in the numbers for the same product version, obtained a few minutes apart from each other on the same machine, illustrates how difficult it is to get useful numbers.
The 9.8 figures are particularly inconsistent so I reran them, with results 376ms, 376ms, 374ms, 376ms (looks like the earlier variation was probably caused by some background task on the machine)
Rerunning 9.1.0.8 gave 508ms, 523ms, 512ms, 510ms.
Updated by Stefan Fritz about 6 years ago
- File SaxonOldTests.java SaxonOldTests.java added
The initial sample code I sent you SaxonOldTests.java runs against 8.9.04 and 9.8 on my machine. (attached it here) Even there I see the difference. With 8.9.0.4 and a single thread it takes 370ms on average (did a warmup). With 9.8.0.14 the same code takes 555ms on average.
I understand that 8.9 does not have the repeat option but please run this sample to see a difference in performance.
Updated by Michael Kay about 6 years ago
Yes, my next step will be to run the supplied sample application against different Saxon releases to see what results I get from that.
Updated by Michael Kay about 6 years ago
Running the SaxonTestLikeSonic test with the issued Saxon 9.8 EE jar, with threads=8, modified to report throughput data, gives
throughput 4.13 transformations per second average processing time 1151ms
Modified to work with Saxon 8.9, I get
throughput 5.12 transformations per second average processing time 1015 seconds
So it is indeed a little faster under 8.9 at this level of concurrency.
If we reduce the number of threads to 1, 8.9 gives us 2.29 / 331
If I switch back to 9.8, with threads=1, I get 1.76 / 404
9.1 gives 2.03 / 365
So we're seeing a regression for this application that we didn't see from the command line, even with a single thread.
Note that the numbers here aren't directly comparable with the command line numbers because the -repeat option on the command line doesn't include source document parsing time. However, the document is small so this shouldn't be significant.
Updated by Michael Kay about 6 years ago
If I remove the code that messes around with transformer factories, I get (with threads=1)
8.9.0.4 - throughput 2.29, elapsed 322ms
9.1.0.8 - throughput 2.02, elapsed 378ms
9.8.0.1 - throughput 1.79, elapsed 403ms
which removes one area of investigation...
Updated by Stefan Fritz about 6 years ago
- File saxon8.zip saxon8.zip added
Seems the difference is not as much as I saw in our product and with the SaxonTestLikeSonic.java. I guess it is because SaxonTestLikeSonic does some caching which we do also in our product.
So what I did is I now modified SaxonTestLikeSonic (same caching, template usage etc) to work with Saxon 8.9. (attached the new project).
There I think the difference is more obvious to see and measure.
This gives me with 8.9 283ms with one thread and the SaxonTestLikeSonic gives 365ms for one thread.
Updated by Stefan Fritz about 6 years ago
My comment was written before you posted your latest update :-)
Updated by Michael Kay about 6 years ago
The execution profiles for 8.9 and 9.8, viewed in JProfiler, are very different. The most noticeable difference is that 9.8 is spending 7% of its time in DocumentSorter, sorting nodes into document order, and this does not seem to appear on the 8.9 profile at all. (By contrast, 8.9 is spending much more time comparing character strings, but it's 9.8 that we want to investigate!)
The calls on DocumentSorter seem to come from many different places. The dominant one is a call on ApplyTemplates.
Looking at the expression tree, there seem to be too many docOrder() calls. I'm wondering if the unusual way of compiling the stylesheet using a TemplatesHandler has affected the optimisations...
I replaced the getTemplates()
method with a simple
getTransformerFactory().newTemplates(new StreamSource(m_URI));
but it made no difference to the performance numbers (throughput 1.75, elapsed 412ms)
Updated by Michael Kay about 6 years ago
But we're not seeing all this document-order sorting when we run the same transformation from the command line. Something is different.
Updated by Michael Kay about 6 years ago
Although the JProfiler results look different, inserting an actual counter into the 9.8 code shows that both the command line and the application do around 5000 sorts-into-document-order while performing this transformation.
Updated by Stefan Fritz about 6 years ago
but it made no difference to the performance numbers (throughput 1.75, elapsed 412ms)
I tried this too with the samples I shared. In my tests it made a difference:
//m_templates = getTemplates();
m_templates = m_xformFactory.newTemplates(new StreamSource(m_URI));
SaxonTestLikeSonic.java 9.8 292ms (Java 8)
Saxon8TestLikeSonic.java 8.9 264ms (Java 8)
m_templates = getTemplates();
//m_templates = m_xformFactory.newTemplates(new StreamSource(m_URI));
SaxonTestLikeSonic.java 9.8 336ms --> seems this has impact (Java 8)
SaxonTestLikeSonic.java 9.8 365ms --> seems this has impact (Java 7)
Saxon8TestLikeSonic.java 8.9 275ms (Java 7)
Saxon8TestLikeSonic.java 8.9 269ms (Java 8)
Updated by Stefan Fritz about 6 years ago
Any new findings on this ? ( our customer is pushing us for an update :-) )
Updated by Michael Kay about 6 years ago
It's on the list, but having put everything else aside to spend a full day on it as soon as it came in, further work is having to be scheduled alongside other priorities.
In the background I've been thinking about how best to tackle it. Investigating a 25% performance regression is much more difficult than investigating a 500% regression, and it's complicated by the fact that we're measuring against a 10-year old release where an awful lot has changed, which means that the usual approach of comparing execution paths may not be effective.
The fact that we're getting different results from the command line and from the API ought to be a significant clue, but I'm not sure how to extract anything from it.
In the end it's one where we may have to admit defeat anyway. Over the course of time some things are going to get faster and others slower, and while it's nice to understand what the actual causative factors are, it's not actually a bug.
Updated by Stefan Fritz about 6 years ago
I totally understand that there are additional features but if thousands of XSLs are processed per hour 25% is a lot. This is why it is a pain for our customers. They use XSL for business transactions and this is increasing the latency for their processes.
Updated by Michael Kay about 6 years ago
I have now retrieved the Saxon 8.9 source code from archives, and have built an IntelliJ project for it. I'm running SaxonTestLikeSonic with threads=1, iterations=40, warmup=5, and getting average processing time of 337 to 341 ms on 8.9 compared with 413 to 417ms on 9.8. I'm now in a position to try and investigate these numbers.
Updated by Michael Kay about 6 years ago
I have retrofitted the TimingTraceListener (-TP option) to work with the 8.9 code so I can compare the 8.9 and 9.8 results. The execution counts for each template appear to be the same, though the ordering by net cost is not. But there's nothing that stands out as an obvious anomaly.
Updated by Michael Kay about 6 years ago
I added code to count the number of Closures and MemoClosures being created - a measure of how much lazy evaluation is being used. The counters were very different:
8.9: Closures 243798, MemoClosures 243618
9.8: Closures 127272, MemoClosures 26286
(Note that the Closure count includes the MemoClosure count.)
This means that 8.9 is doing a lot more lazy evaluation than 9.8. Unfortunately this is not always a good thing - lazy evaluation sometimes costs more than it is worth, and it depends on the data volumetrics which we aren't able to predict, so decisions that benefit one workload will often make things worse for another. But we should at least understand why this is happening.
If we subtract the figures, 9.8 is creating a lot more non-memo-closures than 8.9, which could itself be a possible factor.
I also counted the number of apply-templates calls: 490506 in 8.9, 490188 in 9.8. That's a tiny difference, but it could be a significant symptom (it's probably a consequence of different lazy evaluation strategy).
Updated by Michael Kay about 6 years ago
Many of these MemoClosures
are being created because of change in evaluation strategy for many-to-one general comparisons. An example that is frequently evaluated occurs in xsltest/ThirdPartyTransformations/Agri/UDXML2CXML/XSLT/Core/GlobalTypes.xsl#508 with the predicate *[local-name()=$AltSourceDescriptionName]
. Here $AltSourceDescriptionName
is a template parameter with no declared type information, so we can't generate a simple string comparison, we have to allow for the parameter to be any type at run-time including a sequence. In 8.9 this is done by generating the code let $vv0 := local-name() return some $vv in $AltSourceDescriptionName satisfies $vv0
which introduces two new local variables; these are evaluated lazily leading to the creation of many MemoClosures. Recent Saxon releases have changed this to use a native "compareManyToOne" expression implementation which ought definitely to be an improvement in nearly all cases. Perhaps the only case where it isn't an improvement is where $AltSourceDescriptionName
is an empty sequence, in which case the current 9.8 code is evaluating the call on local-name() whereas the lazy evaluation code bypasses that evaluation.
Updated by Michael Kay about 6 years ago
One thing I notice from the JProfiler output is that we are not doing whitespace stripping while building the source tree, rather we are constructing a space-stripped view of the tree. This has been known to cause significant run-time overhead in the past, and it's not clear why it's happening when the transformation is run in this way. This could also explain why the timings running from the Java API are different from the command line measurements. It's probably particularly significant for this workload because it seems to be doing quite a lot of navigation within a fairly small source document, so the benefit of getting rid of whitespace up-front should be quite high.
We're building the source document using Controller.makeSourceTree() which puts a SpaceStripper on the builder pipeline, so it looks to me as if we correctly build a tree with whitespace removed. But then Controller.prepareInputTree() is adding an unnecessary space-stripping wrapper. This is going to impose a run-time overhead because it will unnecessarily examine every text node to see if its a whitespace node that needs to be stripped.
I experimentally removed the code that adds this wrapper, and it brings the 9.8 execution time down to 351ms which is now very close to the 8.9 time (though still a little bit higher). Of course this is just an experiment, we need to find out what the actual logic for removing this wrapper should be.
Updated by Michael Kay about 6 years ago
More experiments, and I'm convinced this is the answer. After modifying the code to remove the addition of the space-stripping wrapper, I'm seeing execution times in three runs on 9.8 of 350 / 360 / 367 ms, compared with 371 / 370 / 367 ms under 8.9. That's essentially no observable difference.
Note that this is consistent with the fact that the performance is different between the command-line invocation and the JAXP invocation. The command line takes control of space stripping itself and is careful to ensure that it doesn't happen a second time within the Controller.
Now the only question is how best to fix the code, given that there have been previous bugs in this area and it's all a bit fragile. That's partly because the spec is inherently messy, especially in 3.0 where streaming and other new features have complicated the stylesheet invocation model, and partly because the JAXP API has to be retrofitted to this new model, but we ought to be able to do something more robust.
I'm going to experiment with a design where we mark the TreeInfo of an XDM tree to indicate what space-stripping has been done during the building of the tree, and use this information to avoid adding a redundant space-stripping layer to the tree later.
Updated by Michael Kay about 6 years ago
- Subject changed from Performance regression from Saxon 9.1 to 9.8 to Performance regression from Saxon 9.1 to 9.8 (caused by space-stripping)
- Description updated (diff)
- Status changed from In Progress to Resolved
- Priority changed from Low to Normal
- Applies to branch 9.8, trunk added
- Fix Committed on Branch 9.8, trunk added
Fix committed as described.
For 9.8 we store the details of space stripping applied during document construction within the userData of the TreeInfo object.
For 9.9 I have extended the TreeInfo interface to contain the space stripping details as a new property of the TreeInfo.
Updated by O'Neil Delpratt about 6 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 9.9.0.1 added
Bug fix applied in the Saxon 9.9.0.1 major release.
Leave open until fix applied in the next Saxon 9.8 maintenance release.
Updated by O'Neil Delpratt about 6 years ago
- Status changed from Resolved to Closed
- Fixed in Maintenance Release 9.8.0.15 added
Bug fix applied in the Saxon 9.8.0.15 maintenance release.
Please register to edit this issue