Project

Profile

Help

How to improve stylesheet loading

Added by Thomas Berger over 7 years ago

Hello, A company I'm working with is in the process of switching a document processing toolchain from Xalan-J to Saxon EE in order to eventually benefit from XSLT 3.x processing. Currently however everything is strictly XSLT 1.0 with some home-grown extensions implemented in Java. [The document templates make heavy use of includable fragments and creating a document instance implies not only putting data into the right slots but also selecting and arranging several document and/or templates according to the input data. Every template exists in a preprocessed form, namely as an XSLT stylesheet].

Profiling a sample of several thousand cases yields an average of 120% increase in processing time, i.e. processing with Saxon (currently 9.7.0.12) is more than twice as slow as it used to be with Xalan. Closer inspection shows that those steps in the chain which rely on fixed XSLT stylesheets enjoy a 20%-75% reduction, however the one step applying a stylesheet just having been assembled by the preceeding steps is about 6 times slower than with Xalan (and now amounts to about 90% of total time).

This setup certainly falls under the most unfortunate case of applying a comparatively huge Stylesheet to a small input document only once and then throw it away.

BTW the obvious option @--generateByteCode:off@ had already been applied partly because of some "too large" warnings and partly because becaus this makes processing noticeable faster in our case.

I've been starting to investigate one particular example in more detail (hoping to find by experiment the magic configuration option saving the day), confining it to the problematic step in the chain:

-repeat:                                             1         10       100
Total time as reported by Java (sec)               4.4         13        52
same per instance (ms)                            4400       1300       520
Total execution time per Saxon -TP (ms)            250
average execution time reported by Saxon (ms)        -         70        20

Considerable warming up effects seem to show up here but cross checking the same set up with @-nogo@ confirms the finding that execution time plays a very minor role in the whole scenario. As do different values for @-opt@. [BTW the .sef file exported by -nogo contains about 80.000 element nodes, the input document less than 200. Trying to execute the .sef file instead of the .xsl file did not come to a happy end but exposed similar timing, thus I tend to conclude the file operations for dealing with the 30+ xsl:include in my example do no major harm]

I suspect that a more ingenious choice of the SAX parser used will also not have much impact and it seems to me (not being a Java person) that there are no substantial interferences by the Garbage Collector. And (with --generateByteCode:off) also no Java method really sticks out from the profiling report.

Of course(?) the loading and compilation process does make use of threads and throwing more CPUs into the picture might mitigate the real time effect of the performance loss. Could I miss sublte signs that memory might an issue here?

So, what is to be expected when loading a rather large stylesheet? Are there simple factors which determine the setup behavior like outrageously long variable names? Can overly convoluted XPath expressions have an disproportional influence on compile time? Would parsing benefit or suffer from newlines in the input? Are there simple metrics like the setup costs being roughly proportional to stylesheet size (measured in Bytes, Nodes, XPath copmlexity)?

Any help is welcome Thomas Berger


Replies (6)

Please register to reply

RE: How to improve stylesheet loading - Added by Michael Kay over 7 years ago

We're very conscious of this problem, which is particularly acute when applying large stylesheets to small documents, and addressing this is a major focus for the next major Saxon release. Some of my early thinking on this is in a blog article here:

http://dev.saxonica.com/blog/mike/2016/06/improving-compile-time-performance.html

Your posting this also reminded my that I started work a while ago on a follow-up article explaining what we are doing in more detail, an article which remains unfinished and unpublished. It's basically three things:

(a) more efficient implementation of some of the optimization actions, notably loop-lifting

(b) hotspot bytecode generation, where we only generate bytecode for parts of the stylesheet that are executed enough to warrant it

(c) lazy compilation of template rules, where we only process a template rule the first time its pattern is matched: on the theory that massive stylesheets like Docbook and DITA have zillions of template rules processing constructs that don't appear in your average source document.

Exporting the compiled stylesheet to a SEF and loading from that should definitely make a difference, in my experience it improves compilation time by a factor of 3 or so.

I'm sure we could do more. But it's certainly true that until fairly recently we have focused far too heavily on run-time performance, and compile-time has been neglected.

RE: How to improve stylesheet loading - Added by Michael Kay over 7 years ago

I should add that XSLT 3.0 packaging is also trying to address this problem, by ensuring that reusable stylesheet libraries only need to be compiled once, and allowing a large fixed library to be compiled independently of its small customisation layers.

RE: How to improve stylesheet loading - Added by Michael Kay over 7 years ago

As a matter of interest, how is the toolchain control logic implemented: XProc, Ant, Shell, custom Java code, ...?

RE: How to improve stylesheet loading - Added by Thomas Berger over 7 years ago

Hi Michael,

thank you for your quick response.

The toolchain used in production to my knowledge could be described as custom Java code running persistently, so my guess is the other steps in the pipeline already do profit from Saxon's caching mechanisms I've been reading about (I presume they are still loaded every time since it serves several distinct sets of templates each with it's own set of identically named but independently maintained stylesheets). Each step is in principle reading from and writing to disk so a minimal optimization would be to intercept the document produced in the previous two steps and feed them directly to the stylesheet ingesting and the source document parser of the critical step while serializing them to disk in parallel (perhaps we could even postpone this until processing finishes successfully and we are sure we don't need the document for debugging). But the actual parsing does not seem to be the issue at hand and parallelizing the processing chain hasn't been an issue yet (AFAIK several of these toolchains run in parallel, minimizing real time for processing one document is one goal, but CPU time has to be considered too: I have no actual knowledge but would be surprised if there are some spare processors just idling all the time).

My experiments with a SEF file had only been sketchy (producing it implies a full parse so it will never gain you a benefit on the first run) but gave me the impression that loading that was no improvement over loading the original stylesheet (from which I concluded that the code scattered over several file doesn't make any difference). As mentioned it also could not be processed without error (issue with custom extensions, haven't yet tested that with 9.7.0.18).

The inclusion mechanisms are currently xsl:include only, although most of them could be alternatively realized by xsl:import. But since we are talking about tens of thousands of fragments from which a considerable portion is "active" (I mean they may be changed any time and on testing stages changes do occur frequently and have to be immediately usable) combining all of them to one library would be unfeasible and organizing them into a couple of hundred smaller ones would be an huge organisational challenge - so I fear any precompiling has to be done within the lines of the current granularity of file organisation. We currently convert a template from the authoring system into an XSLT stylesheet for the processing chain once and consider this being some kind of precompilation, storing them in some other format (once upon a time there was the promise of XSLTC) friendlier to compilation should be o.k., as long as this can be used by xsl:include or at least xsl:import and doesn't have to be kept in memory all the time.

Loop-lifting and custom extension functions don't harmonize well at least on first thought, and actually leaving XSLT and XPath 1.0 will give us the chance to get rid of some of our custom functions, but our current approach is more evolutionary, i.e. change the horse first and then gradually teach some new tricks to the dog riding it.

Hotspot bytecode generation: You mean after evaluating a select expression relating call frequency with code size and weighting against bytecode generation costs? Sounds cool (but not sure whether this could help in my situation).

Lazy compilation sounds interesting four our case (variants over variants in one template, never all of them are selected). Unfortunately our templates do not have many rules (well, but perhaps similar optimizations are possible for xsl:for-each ;-). However each of the xsl:included fragments contains mainly one named template which is called from the main stylesheet. There are fragments we actually know that will be used (they are dynamically mentioned in the input data and we have to build the according xsl:includes into the main stylesheet actually run - one of the reasons why the main processing is done by a disposable stylesheet assembled on the fly), but there also are fragments statically referenced from other stylesheets and we don't have extra knowledge whether the current input document will take an execution path which will actually call "the" template of an included stylesheet.

viele Grüße Thomas Berger

RE: How to improve stylesheet loading - Added by Michael Kay over 7 years ago

It might be useful for us to take a look at the generated stylesheet.

Generated code often has characteristics that make it very expensive to compile, simply because it's different from the normal hand-written code that we usually have to deal with. For example we've seen (and solved) cases in the past where performance was pathological for a template containing hundreds of local variable declarations.

Unless this particular stylesheet can take advantage of the 3.0 capability to compile modules independently, it does look like a case where none of the techniques for amortizing compile cost over multiple executions is going to make much difference, and neither is the strategy of eliminating compile cost for code that isn't executed at all. So it would be good to see if the compile cost is simply a consequence of the size of the stylesheet, or if there's some pathological characteristic of the code that causes compilation to be unusually slow. We have some internal mechanisms to instrument compile-time behaviour that could yield insights.

RE: How to improve stylesheet loading - Added by Michael Kay over 7 years ago

This thread continued by email after sample code was supplied offline. Here's a summary of the follow-up conversation:

MK: We're doing a lot of work to improve compile-time performance in 9.8, see recent blog postings.

Cutting out bytecode generation made a massive difference (from 50 seconds to 4 seconds) but the user had already done this.

MK: There's 2 megabytes of XSL code to be compiled here (about the same size as DocBook or DITA), so I don't feel 4 seconds is unreasonable - though of course we would like to get it down.

MK: proposed various ways of reducing the size of the generated code in order to speed the compilation, and/or structuring the code into packages so the compile-time costs can be amortized.

TB: suggested doing some analysis of why Xalan compiled the code more quickly. MK: suggested that an XSLT 2.0 processor has to do more type analysis.

Today Radu Coravu has reported a similar problem: see bug #3209. We'll see if we can make any more progress with that one. It might be worth comparing the compile time paths on 9.6 and 9.7, which I don't think we did for the Berger use case.

    (1-6/6)

    Please register to reply