Iter Versus Call Template; Byte-Code performance & Memory Issues
Added by David Rudel almost 11 years ago
I recently compared performance of an xsl:iterate based implementation with an xsl:call-template implementation of the same program. I found two observations that may be significant for Michael and am posting them here. I'm using Saxon-EE 9.5.1.2 in oXygen.
The two implementation (attached) give identical output and the structure of the key block is identical in both scripts.
I'll describe the implementations below, but here are the observations:
-
Byte-code generation has no performance effect on the Call-Template implementation. However, it reduces performance in the iterate-based implementation. Using Byte-code generation increases run-time by ~40% in the earliest iterations and ~80% in the later iterations.
-
When max RAM is set to about 1450 MB, the Call-Template version ran into problems around iteration 100 or so. This appears to be a peculiar issue with memory management. oXygen did not crash and I did not get any errors. Instead, the program just started running more and more slowly and there was a marked increase in CPU work. Performance rapidly degrades.
For whatever it is worth, the situation in number 2 did not cause an actual error even when windows paging file was removed entirely... removing paging file had no noticeable effect on performance. The same effect is seen regardless of byte-code generation.
The xsl:iterate version was able to work through thousands of iterations without performance degradation.
I originally assumed that observation 2 was caused by a lack of recognizing Tail-Call structuring of the called template, but I would have thought this would have lead to an actual heap overflow error rather than the observed slow-down.
The program itself takes an XML source file that provides various edge weights to a directed graph. It then modifies those edge weights and removes certain edges to achieve a smaller digraph that efficiently allows flow among the nodes using about 40% of the original edges.
In each iteration the current graph [$input] is inspected, necessary modifications are found and listed in $mid.iteration variable. These elements are then collected together to allow a new graph [$next] to be created. This serves as the input to the next iteration.
The attached versions are not the best implementation. I've chosen this version because it is relatively simple and illustrates the problems described above. I hope it is helpful to the Saxon developers. The scrips and source file have been scrubbed to be safe for public viewing.
In both scripts $max.iterations controls the number of iterations before the $next variable is returned as the final value (after scaling and some other information is added).
Replies (5)
Please register to reply
RE: Iter Versus Call Template; Byte-Code performance & Memory Issues - Added by Michael Kay almost 11 years ago
Thanks for letting us know about it - we'll try and investigate the causes.
First though, to reproduce your measurements we need the missing functions.xsl file.
RE: Iter Versus Call Template; Byte-Code performance & Memory Issues - Added by David Rudel almost 11 years ago
I actually don't think any of the functions/templates in the functions.xsl are used in the templates, but just in case, here you go.
functions.xsl (9.8 KB) functions.xsl |
RE: Iter Versus Call Template; Byte-Code performance & Memory Issues - Added by Michael Kay almost 11 years ago
I think I can explain, without first reproducing it, why the call-template version should use more heap space.
Note that tail-call optimization works differently for templates and for functions. What I am saying here applies to xsl:call-template and not necessarily to tail function calls or to tail calls on xsl:appiy-templates.
There are in effect two stacks around at run-time: the Java (JVM) stack, and the XSLT stack. Tail-call optimization for templates prevents a tail call consuming space on the Java stack, but it does not prevent it consuming space on the XSLT stack (which is held in the Java heap). Since local XSLT variables are held in the XSLT stack, this may consume a significant amount of memory.
It would be interesting to see how an implementation using tail function calls compares (or for that matter, one using fold-left()). Tail call optimization for functions works differently - it compiles the recursive function into a loop. As a result, it only works for self-recursion (which means it is not applicable to the xsl:apply-templates case, though the same technique could be used for a self-recursive named template).
I'm more concerned about the bytecode effects. Running with bytecode enabled should never make performance worse.
RE: Iter Versus Call Template; Byte-Code performance & Memory Issues - Added by Michael Kay almost 11 years ago
I'm not getting very far in determining why the bytecode version is slower (in my case, about 50% slower).
Java CPU tracing shows that in the bytecode case, a very large number of the hits are in generated code. That's a sign that bytecode generation is making a difference, but it can be a difference either way. Unfortunately it's very difficult to determine where in the bytecode the hotspot is; the trace indicates line 197 of the XSLT code, but there's no sign of anything unusual happening there. I've traced the bytecode and it all seems to be running as expected.
I've traced how many MemoClosures get created, which is sometimes an indicator of differences in execution strategy, and it's identical in both cases.
RE: Iter Versus Call Template; Byte-Code performance & Memory Issues - Added by David Rudel almost 11 years ago
I decided to take a whack at this this morning. I simplified as much as I could (without any regard to keeping the same output), to whittle the script down to the simplest thing that would still show a difference of performance.
I cut down the script significantly, and I was still seeing about a ~40-50% difference in performance, so I tried changing the keys to maps, and that didn't do anything.
Finally, I tried replacing the most internal iterate command with a single line sum( for each .... return...) command. This led to interesting results that I hope are useful.
The two attached scripts give identical output and they both show a performance degradation when byte-code is switched on. However, the "iterate" version shows a much more significant degradation. The difference in run-time on my machine is about 40-45%. The difference in run-time for the non-iterate version is only about 10-15%.
I'm hopeful that these attachments are useful for either of two reasons:
- They are both much simpler than the original, so analysis of them in their own right might be easier.
- They show a contrast in performance pointing to a specific line/block of code.
Please register to reply