Bug #5910
closed
Double reporting of template execution in timing profile
Applies to branch:
11, 12, trunk
Fix Committed on Branch:
11, 12, trunk
Fixed in Maintenance Release:
Description
In performance testing on a particular XML + XSLT combination, I noticed it was taking roughly twice as long on the command line as it was in Oxygen. I did a profile and notice that the initial template mode matching the document was being visited twice in the command line version, but only once in Oxygen (which is running Saxon 10.6).
I replicated the anomaly with the simplest of XSLT files, and, with profiling on, found that Saxon EE 9.9.1.8 and 10.9 on the command line visited the document template mode only once. But for Saxon EE 11.5 and 12.0 the document template mode was visited twice.
I've confirmed the above on two different XML + XSLT combinations on the command line across the four products.
Did you profile in oXygen with 10 EE as well? Or was that HE perhaps?
I have now profiled the stylesheet
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="/">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
with 11.5 HE and EE and 10.9 HE and EE and indeed both EE versions show a count of 2 for matching /
e.g.
module line instruction count average time (gross/ms) total time (gross/ms) average time (net/ms) total time (net/ms)
sheet1.xsl 9 xsl:template / 2 4.218 8.436 2.136 4.272
while the both HE versions show a count of 1 for matching /
e.g.
module line instruction count average time (gross/ms) total time (gross/ms) average time (net/ms) total time (net/ms)
sheet1.xsl 9 xsl:template / 1 4.008 4.008 4.008 4.008
No idea why EE vs. HE means count 2 vs. count 1 for /
, I guess Michael can explain that.
In all cases I'm working with EE, because that's the product our shop has licensed and uses.
I tried to put together a simple example that I could paste right in the ticket, but the optimizations made that challenging to do. (I want to demonstrate that even with all optimizations on, something is not quite right.) I will send the original files under separate cover.
There is something wrong with the injection of calls to the timing profiler, such that the body of the template rule, instead of being wrapped in a single trace call
trace(bodyExpresion)
is being doubly-wrapped
trace(trace(bodyExpression))
Which means there are two entries recorded to the template, and two exits, resulting in double counting of the execution time.
I'm pretty sure the problem is specific to the -TP timiing profiler, and there is no issue when running with profiling off. But I haven't completely worked out what's going wrong. In particular I can't see any obvious reason for it being EE-specific.
OK, so I guess two issues, one being overcounting in the profiling, and the other is something yet TBD.
Because I can't release the background materials under the stipulated terms of the support board, I've sent them separately.
The first call to inject tracing code happens in XSLTemplate.compileTemplateRule()
. The second happens in TemplateRuleInitializer.init()
. At first sight the second call seems unnecessary - but it's hard to be 100% sure. A safe fix would be to ensure that (for the TimingTraceListener at least) injecting a trace expression a second time has no effect.
This eliminates the double counting.
Note that the absolute timings when running from the command line are largely meaningless unless it's a task that takes 30 seconds or more, because it takes that long for JVP hotspot optimisation to settle down. It's mainly the counts that are useful, as well as relative timings.
Moving on from the profiling problem to the actual workload performance, I see the following results under SaxonJ. This is running with Java 11 from the command line, in each case with -t -repeat:5
9.9.1.8 - 20.107s
10.9 - 20.150s
11.5 - 21.518s
11.5 with bytecode disabled - 18.098s
12.0 - 34.231s
So there is indeed a performance regression here, which we need to investigate. Moreover, it has nothing to do with the fact that bytecode generation was dropped in 12.0 (on the contrary, the 11.5 results confirm that, for this workload, this was the right thing to do.)
Now some figures for SaxonCS. This is running from the command line, but within Rider on MacOS, and on the current development branch rather than released product. But the same options (-t -repeat:5)
SaxonCS 11 - 55.389s
SaxonCS 12 - 48.734s
These figures are not a great surprise: we know that the work we did on expression elaboration in 12.x gave much more benefit in C# than it did in Java. The ratio of SaxonCS to SaxonJ for the same workload is also not untypical. (A caveat is that we are measuring timings using different instrumentation APIs so the figures might not be directly comparable).
I'm moving discussion of the performance regression to a new issue #5914 so that we can track closure of the timing profile problem separately in this issue.
- Subject changed from Double evaluation of document node on command line, Saxon 11+ to Double reporting of template execution in timing profile
- Category set to Diagnostics
- Status changed from New to In Progress
- Assignee set to Michael Kay
- Applies to branch 11, 12, trunk added
- Platforms Java added
- Status changed from In Progress to Resolved
- Fix Committed on Branch 11, 12, trunk added
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.1 added
Bug fix applied in the Saxon 12.1 maintenance release.
Please register to edit this issue
Also available in: Atom
PDF