Project

Profile

Help

Bug #2747

closed

Performance regression in 9.7.0.4 vs 9.1.0.8

Added by Anonymous almost 8 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2016-05-13
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
9.7
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

I wanted to try out the new DITA Open Toolkit optimization in Saxon-EE. I was surprised to find that even with the optimization enabled for some parts of the DITA-OT publishing pipeline, publishing the DITA-OT user guide with Saxon-EE 9.7.0.4 is consistently around 10 seconds slower than with Saxon-B 9.1.0.8, which is what DITA-OT uses by default because some of the XSLT code in it relies on Java extensions. The total DITA-OT execution time for that document with Saxon-B 9.1.0.8 is around 20–25 seconds on my machine.

So I tried boiling down the issue to its absolute minimum, and I was surprised to find that Saxon 9.7.0.4 is much slower at performing a simple identity transformation than Saxon-B 9.1.0.8. For example:

$ time java -cp Saxon-HE-9.7.0-4.jar net.sf.saxon.Transform -repeat:12500 -xsl:identity.xsl -s:identity.xsl -o:/dev/null
*** Average execution time over last 6251 runs: 0.206689ms
32.67s user 0.57s system 142% cpu 23.396 total
$ time java -cp saxon-9.1.0.8.jar net.sf.saxon.Transform -repeat:12500 -xsl:identity.xsl -s:identity.xsl -o:/dev/null
*** Average execution time over 12500 runs: 1ms
14.59s user 0.13s system 103% cpu 14.173 total

In this simple transformation, Saxon-B 9.1.0.8 is over twice as fast over 12 500 repeats.

Is there something I'm missing? Maybe a configuration option I should toggle? With performance like this, using Saxon 9.7 with DITA-OT is pretty much a no-starter.

The performance is similar on my MacBook and on a Linux server I tested this on.


Files

Saxon-HE-9.7.0-4.txt (3.94 MB) Saxon-HE-9.7.0-4.txt java -cp Saxon-HE-9.7.0-4.jar net.sf.saxon.Transform -t -repeat:12500 -xsl:identity.xsl -s:identity.xsl -o:/dev/null --generateByteCode:off Anonymous, 2016-05-13 15:11
saxon-9.1.0.8.txt (4.8 MB) saxon-9.1.0.8.txt java -cp saxon-9.1.0.8.jar net.sf.saxon.Transform -t -repeat:12500 -xsl:identity.xsl -s:identity.xsl -o:/dev/null Anonymous, 2016-05-13 15:12
userguide_MERGED.xml (1.78 MB) userguide_MERGED.xml Anonymous, 2016-05-13 16:19
Saxon-HE-9.7.0-4_userguide.txt (353 KB) Saxon-HE-9.7.0-4_userguide.txt java -cp Saxon-HE-9.7.0-4.jar net.sf.saxon.Transform -t -repeat:1000 -xsl:identity.xsl -s:userguide_MERGED.xml -o:/dev/null --generateByteCode:off Anonymous, 2016-05-13 17:32
saxon-9.1.0.8_userguide.txt (428 KB) saxon-9.1.0.8_userguide.txt java -cp saxon-9.1.0.8.jar net.sf.saxon.Transform -t -repeat:1000 -xsl:identity.xsl -s:userguide_MERGED.xml -o:/dev/null Anonymous, 2016-05-13 17:32
Actions #1

Updated by O'Neil Delpratt almost 8 years ago

Hi,

Instead of using the time command you can use the -t option available in Saxon. The -t option will show up timings for compilation, execution and other processes of structures.

It might be an idea to try switching off the bytecode feature which maybe adding to the compile time. i.e. --generateBytecode:off

Example of command:

java -cp Saxon-HE-9.7.0-4.jar net.sf.saxon.Transform -repeat:12500 -xsl:identity.xsl -s:identity.xsl -o:/dev/null -t --generateByteCode:off

Please let us know how this goes. It would be good to narrow down where the time is being spent.

Actions #2

Updated by Anonymous almost 8 years ago

Hi O'Neil,

Thanks for the quick reply!

I attached the timing results for 9.7.0.4 and 9.1.0.8. It looks like stylesheet compilation takes longer with 9.7.0.4 even with bytecode generation disabled, but that's not enough to explain the difference. I can't immediately glean any other useful information from the timings — it might be that using a larger stylesheet might be better at highlighting the differences?

Actions #3

Updated by John Lumley almost 8 years ago

  • Category set to Performance

The sample XML file you're apparently using is so small that the figures don't mean much (i.e. 0ms for the 9.1 - below its resolution). I'd suggest re-running the identity with some reasonable size input (at least 10kB) and a lower repeat count.

Actions #4

Updated by O'Neil Delpratt almost 8 years ago

Looking at the timing results Saxon 9.1 does not give repeated compilation time for the different runs. The first run would be expensive due to the Java VM warm-up. The compilation of Saxon 9.7 on the first run is slower than Saxon 9.1 but we cannot simply conclude that 9.7 is slower than 9.1 because we need to factor in the warm-up time and machine activities. What is clear is that the compilation of Saxon 9.7 drops considerably after the first run.

If we examine the execution times Saxon 9.7 clearly outperforms Saxon 9.1. But as John mentioned it would be good to use a larger XML document to better compare.

Actions #5

Updated by Anonymous almost 8 years ago

As suggested, I tried reran the transformation with a larger input document (1.8Mb). That does seem to reduce the gap quite a bit.

I attached the timing results and the input document.

For what it's worth, here's what the time command reports:

$ time java -cp saxon-9.1.0.8.jar net.sf.saxon.Transform -repeat:1000 -xsl:identity.xsl -s:userguide_MERGED.xml -o:/dev/null
*** Average execution time over 1000 runs: 57ms
       57.86 real        61.61 user         2.12 sys

$ time java -cp Saxon-HE-9.7.0-4.jar net.sf.saxon.Transform -repeat:1000 -xsl:identity.xsl -s:userguide_MERGED.xml -o:/dev/null --generateByteCode:off
*** Average execution time over last 501 runs: 72.675179ms
       77.06 real        86.60 user         1.72 sys

I ran those transformations a couple of times and sometimes the difference is smaller, sometimes larger (my system isn't under load when running those transformations, though), but so far 9.1.0.8 has been faster every time. However, as you suspected, the difference is less significant with the larger input document.

Here's a quick and dirty (and hopefully correct) grep that sums up the execution times:

$ grep "Execution time: " saxon-9.1.0.8_userguide.txt | cut -d' ' -f 3 | paste -sd+ - | bc
48510

$ grep "Execution time: " Saxon-HE-9.7.0-4_userguide.txt  | cut -d' ' -f 3 | cut -d'm' -f 1 | paste -sd+ - | bc
57940.927567

With regard to DITA-OT in specific, it's starting to look like the performance issue there concerns Saxon-EE in particular , which leads me to think that it might be caused by some configuration option that's enabled in Saxon-EE by default but not in Saxon-HE. However, it's difficult to pinpoint the issue because there are so many steps in the DITA-OT publishing process and due to its architecture, it isn't always very easy to just profile a single transformation step. I'll need to do some more investigating.

Actions #6

Updated by Michael Kay almost 8 years ago

Just to reinforce O'Neil's comments on your original measurements, the measurements are not comparable because when you use -repeat:200

  • in 9.1, the stylesheet is compiled once and executed 200 times

  • in 9.7, the stylesheet is compiled 200 times and then executed 200 times

The difference reflects the fact that over recent releases we have started to recognize that compile time performance is something that we need to pay more attention to, and the first step in doing that is to measure it.

We are aware that compile time performance has got worse over recent releases, reflecting the fact that we spend an increasing amount of time optimizing the query to maximise run-time speed. There are definitely cases where this is an inappropriate strategy, in particular when compiling a very large stylesheet which is then used to transform a very small source document.

Actions #7

Updated by O'Neil Delpratt almost 8 years ago

  • Assignee set to John Lumley
Actions #9

Updated by Eliot Kimber almost 8 years ago

I'd like to better understand the performance characteristics of Saxon 9.7+ as used within the Open Toolkit. The apparent performance issues are a barrier to upgrading the OT from Saxon 9.1, where it is currently stuck.

Currently the OT uses Saxon through normal Ant XSLT tasks as part of a multi-stage Ant pipeline.

My questions:

  • Is there anything the OT should be doing to maximize Saxon performance in that case?

  • How best to measure transform performance so that we can meaningfully compare the 9.1 and 9.7 performance?

Actions #10

Updated by Michael Kay almost 8 years ago

  • Status changed from New to In Progress
Actions #11

Updated by Jorge Williams over 7 years ago

I'm seeing similar performance degradation with the latest Saxon compared with 9.1.0.8. In my case we are dealing with source documents in the 10s of Megabytes and the performance difference is significant.

9.7.0.10 :

time java -jar ~/SaxonEE9-7-0-10J/saxon9ee.jar  -t -o:normalized/allfeeds.wadl -xsl:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl.xsl  -strip:all -s:allfeeds.wadl format=tree-format xsdVersion=1.1 resource_types=keep flattenXsds=false
Saxon-EE 9.7.0.10J from Saxonica
Java version 1.8.0_74
Using license serial number S003962
URIResolver.resolve href="normalizeWadl1.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl.xsl"
URIResolver.resolve href="normalizeWadl2.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl3.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl4.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl5.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl1.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl.xsl"
URIResolver.resolve href="normalizeWadl2.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl3.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl4.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
URIResolver.resolve href="normalizeWadl5.xsl" base="file:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl1.xsl"
Generating byte code...
Stylesheet compilation time: 1.229742s (1229.742135ms)
Processing file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 4.090026ms
Tree size: 112 nodes, 0 characters, 269 attributes
URIResolver.resolve href="file:/Users/jorgew/projects/standard-usage-schemas/wadl/feed.wadl" base="file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl"
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/feed.wadl using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 19.063416ms
Tree size: 345 nodes, 7071 characters, 348 attributes
URIResolver.resolve href="file:/Users/jorgew/projects/standard-usage-schemas/wadl/product.wadl" base="file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl"
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/product.wadl using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 214.053814ms
Tree size: 50344 nodes, 1290140 characters, 14877 attributes
URIResolver.resolve href="file:/Users/jorgew/projects/standard-usage-schemas/wadl/cadf.wadl" base="file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl"
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/cadf.wadl using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 5.21093ms
Tree size: 552 nodes, 19804 characters, 160 attributes
Execution time: 6m 49.954626s (409954.626514ms)
Memory used: 114583016

real	6m51.856s
user	7m2.252s
sys	0m4.364s

9.1.0.8

time java -jar ~/projects/wadl-tools/lib/saxon-9.1.0.8.jar  -t -o:normalized/allfeeds.wadl -xsl:/Users/jorgew/projects/wadl-tools/xsl/normalizeWadl.xsl  -strip:all -s:allfeeds.wadl format=tree-format xsdVersion=1.1 resource_types=keep flattenXsds=false
Saxon 9.1.0.8J from Saxonica
Java version 1.8.0_74
Stylesheet compilation time: 463 milliseconds
Processing file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/allfeeds.wadl using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 5 milliseconds
Tree size: 112 nodes, 0 characters, 269 attributes
Loading net.sf.saxon.event.MessageEmitter
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/feed.wadl using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 15 milliseconds
Tree size: 345 nodes, 7071 characters, 348 attributes
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/product.wadl using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 192 milliseconds
Tree size: 50344 nodes, 1290140 characters, 14877 attributes
Building tree for file:/Users/jorgew/projects/standard-usage-schemas/wadl/cadf.wadl using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 6 milliseconds
Tree size: 552 nodes, 19804 characters, 160 attributes
Execution time: 84633 milliseconds
Memory used: 105085312
NamePool contents: 202 entries in 165 chains. 111 prefixes, 101 URIs

real	1m25.354s
user	1m33.456s
sys	0m1.845s

This is the XSL that I'm using to test:

https://github.com/rackerlabs/wadl-tools/blob/master/xsl/normalizeWadl.xsl

The document I run this on is private, but I can probably give you access through back channel if it'll help you diagnose the issue.

Thanks

Actions #12

Updated by Michael Kay over 7 years ago

Thanks for reporting this, we'll be happy to look at it, but we do need a source document to test against.

The timing difference here is much greater than with the previous workload reported on this thread. Differences of 10-20% can be very hard to analyze, but a difference of 500% usually has a simple cause which is relatively easy to discover. So I'm looking forward to seeing the test document.

Actions #13

Updated by Jorge Williams over 7 years ago

Great, sent you an email with the test document.

Actions #14

Updated by Michael Kay over 7 years ago

  • Assignee changed from John Lumley to Michael Kay

Thanks, I have reproduced this, getting very similar numbers.

Actions #15

Updated by Michael Kay over 7 years ago

9.6 time is also 6m 49s

9.4 time is 6m 0s

9.3 time is 1m 12s

So it seems the regression was between 9.3 and 9.4

Actions #16

Updated by Michael Kay over 7 years ago

Java profiling shows 90% of the time is in ComplexContentOutputter.startContent() and NamespaceIterator.sendNamespaces().

Very similar profile with bytecode disabled.

Thread dump shows:

at net.sf.saxon.tree.util.NamespaceIterator.sendNamespaces(NamespaceIterator.java:66)

at net.sf.saxon.expr.instruct.Copy.outputNamespaceNodes(Copy.java:391)

at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:363)

at net.sf.saxon.expr.instruct.Copy.processLeavingTail(Copy.java:435)

at net.sf.saxon.expr.instruct.Choose.processLeavingTail(Choose.java:838)

at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:656)

at net.sf.saxon.expr.instruct.TemplateRule.applyLeavingTail(TemplateRule.java:356)

at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:456)

at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:301)

at net.sf.saxon.expr.instruct.ApplyTemplates.process(ApplyTemplates.java:254)

at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:366)

at net.sf.saxon.expr.instruct.Copy.processLeavingTail(Copy.java:435)

at net.sf.saxon.expr.instruct.Choose.processLeavingTail(Choose.java:838)

at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:656)

at net.sf.saxon.expr.instruct.TemplateRule.applyLeavingTail(TemplateRule.java:356)

at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:456)

which all looks perfectly reasonable; the problem must be that a large number of namespace nodes are being written.

Actions #17

Updated by Michael Kay over 7 years ago

Sure enough, stopping it in the debugger, I see one case where ComplexContentOutputter.startContent() has a list of 97 pending namespaces to deal with. These all look like real namespaces, most of them are in the www.rackspace.com domain, and they have prefixes like w_ns46 that give the impression that the use of such a large number of namespaces was a deliberate design choice.

And when we look in the source document allFeeds.wadl we do indeed find vast numbers of namespace declarations.

So I guess something happened between 9.3 and 9.4 to make the handling of such large numbers of namespace declarations perform poorly.

It turns out that it was in 9.4 that we replaced the use of integer namespace codes by NamespaceBinding objects. This was done in order to reduce contention on the NamePool: it was a change that gave significant benefits for the workloads we measured where contention was a significant problem. There is always a risk that a change made to benefit the performance of one workload will have an adverse effect on others, but that doesn't necessarily mean there is nothing we can do about it.

Actions #18

Updated by Michael Kay over 7 years ago

I can think of two possible theories as to why the namespace handling has become more expensive. One possibility is that we are not taking as much care to eliminate unnecessary namespaces early on in the pipeline (for example where an element has the same namespaces as its parent). The other is that comparison operations needed to eliminate unnecessary namespaces have become more expensive because of the change in representation.

I instrumented both 9.3 and 9.7 to count the number of namespace bindings passing through ComplexContentOutputter.startContent(): the figures are roughly the same: 680m in each case. A vast number, no doubt accounting for a very high proportion of the transformation cost in both cases, but this doesn't get us closer to finding where the extra costs come from in 9.7.

To discover whether the costs are upstream or downstream of ComplexContentOutputter, I commented out the code that copies the namespace nodes. Execution time now 48s for 9.3, versus 183s for 9.7. That seems to suggest that the costs aren't all in once place, the differences occur both upstream and downstream of CCO.

After reinstating this code, I experimented (in 9.7) with allocating NamespaceBinding objects from a static pool, for faster equality comparison. A slight improvement, but nothing dramatic: time 5m 44s.

Next experiment (again after reverting the above change) was to cut out the code in NamespaceIterator that ensures we don't output a namespace declaration on a parent element that is shadowed by a different declaration of the same prefix on a child element. (Although this isn't likely to happen in practice, the code is of course necessary, but I wanted to see what it was costing). This again took a minute off the time, taking it down to 5m 43s.

It's beginning to look as if the extra costs of using NamespaceBinding objects aren't all in one place, but are very distributed. In the normal course of events the extra costs are negligible, but with 680 million of them, they add up.

Actions #19

Updated by Jorge Williams over 7 years ago

Interesting. We use namespaces extensively because we're dealing with validating messages for a number of products and each product (essentially) manages it's own XML schema etc, the namespaces is there to prevent clashes.

You'll notice in wadl/product.wadl that a good portion of the document is dedicated to XPath assertions -- we've had all sorts of bugs where we generated an XPath and lost the namespace declaration that went with it along the way. Very annoying. I wouldn't be surprised, if we could eliminate all of those declarations and the kinds of errors that go along with them if we switched over to XPath 3.0 and used a braced URI literal!

That said, it'd be nice not to have to revisit that code :-)

Actions #20

Updated by Michael Kay over 7 years ago

My next experiment was to modify the stylesheet to replace all occurrences of xsl:copy by <xsl:copy copy-namespaces="no">.

The execution time came down to 9 seconds.

I haven't compared whether the result is equivalent but this demonstrates dramatically where the stylesheet costs are going.

I'm coming to the conclusion that although the performance regression between 9.3 and 9.4 is unfortunate, it's not a bug: it's a consequence of a deliberate design decision that optimizes some workloads at the expense of others, and that a transformation using hundreds of namespaces in this way is so untypical that it wasn't an unreasonable design decision. Even if the addition of copy-namespaces="no" isn't sound, I suspect there are ways of improving the performance of this stylesheet dramatically by changing the XSLT code, and that no Saxon changes are warranted.

Actions #21

Updated by Michael Kay over 7 years ago

  • Status changed from In Progress to Resolved
  • Priority changed from Low to Normal

Because this latest problem (which shallow copy of namespaces) has no connection at all with the original subject of this bug entry, I am belatedly transferring it to a new issue at

https://saxonica.plan.io/issues/3011

Actions #22

Updated by Anonymous over 7 years ago

Just checking: did you mean to mark this issue as Resolved? I don't think the regression with DITA-OT is related to namespaces, since the DITA-OT stylesheets makes very limited use of them, AFAIK.

Actions #23

Updated by Michael Kay over 7 years ago

  • Status changed from Resolved to In Progress

Sorry, no, I should have left the original issue open. It gets very confusing when someone adds a new issue to an existing thread believing it to be the same issue....

Actions #24

Updated by Michael Kay almost 7 years ago

  • Status changed from In Progress to Resolved

I think it's time to close this issue (with no further action) - which actually contains two threads relating to different performance problems. I think we made useful progress on both issues, and although there will always be scope for further improvements, it would be best for any further investigation to start with new measurements and use cases.

Please register to edit this issue

Also available in: Atom PDF