Execution time from SEF exceeds execution time from XSLT source
A user has submitted a test case where execution from XSLT source takes 1.64s, while execution from SEF takes 1.96s.
#1 Updated by Michael Kay 21 days ago
-TP output for the XSLT case shows the dominant code is a tail-recursive named template find_inherited_classifier at line 655 of cim16_pre_process.xslt.
First attempt to step through both cases in the debugger shows no obvious differences in execution path. In both cases tail recursion is being used; in both cases the key name has been resolved statically to a key definition.
The Java hprof output for the two cases shows no glaring differences.
#2 Updated by Michael Kay 21 days ago
Timings in development environment, 11 branch: Source XSLT 1.773s, SEF 2.014s.
Tried a few counters, chosen fairly randomly:
SEF counters TinyElementImpl.copy = 116350 MemoClosure = 201716 FilterIterator.getNextMatchingItem() = 32219 CallTemplate.process = 48299 XSLT counters TinyElementImpl.copy = 116248 MemoClosure = 217474 FilterIterator.getNextMatchingItem() = 32211 CallTemplate.process = 96367
Clearly we need to explore why
CallTemplate.process is being called twice as often in the source XSLT case.
I added some more counters and the others are all much the same for the two cases:
NamedTemplate.expand() = 217480 CallTemplatePackage.processLeavingTail = 121217 new CallTemplatePackage() = 121215 CallTemplate.process = 96377 CallTemplate.processLeavingTail() = 121241
An unexpected observation though is that in the SEF case, the figures are the same every time, whereas in the source XSLT case, they differ slightly from one run to the next. I'm wondering if
xsl:result-document multi-threading plays a role?
#3 Updated by Michael Kay 21 days ago
If I disable multithreading, the XSLT and SEF test cases now show essentially the same performance, though the discrepancy in the
CallTemplate.process counter remains.
Investigating with the debugger, it seems that multithreading is not being used in the SEF case.
This would seem to explain the performance difference. It would be nice also to know why the counter is different...
#4 Updated by Michael Kay 21 days ago
Counter for variable on line 367
XSLT: let $inherited_attribute_classifier = 96307 SEF: let $inherited_attribute_classifier = 48272
Counter for xsl:apply-templates mode="pre_process_attribute" (line 512)
XSLT: Mode pre_process_attribute = 16012 SEF: Mode pre_process_attribute = 99
but in both cases the count for
apply-templates mode="pre_process_class" = 8 apply-templates mode="pre_process" = 4
#5 Updated by Michael Kay 21 days ago
I have now fixed the SEF export/import so that xsl:result-document is asynchronous in both cases, and there is no longer any performance discrepancy.
But there are still substantial differences in the counters, and I would like to know why. I suspect some difference in the strategy for lazy evaluation of variables. Alternatively, it could be bytecode generation?
I'm looking at the apply-templates instruction on line 512, which appears (now) to be executed 16000 times in the XSLT case and not at all the SEF case. Yes, it does seem to be an effect of bytecode generation; the count is high in the SEF case when bytecode is disabled.
Since the default behaviour of bytecode generation is always going to be a bit unpredictable, especially when combined with multi-threading, and since this is having no adverse effect on performance, I think I won't pursue this any further.
#6 Updated by Michael Kay 21 days ago
- Status changed from New to In Progress
I've now turned my attention to the second stylesheet, and (in the development branch) this is still showing faster execution from the .xslt file than the .sef - 585ms vs 661ms. With multithreading disabled, the timings are 664ms vs 632ms.
On the 10 branch, with multithreading disabled, the timings are 589 vs 624; with multithreading enabled, 551 vs 581.
The difference looks significant, but at about 7% it's not easy to investigate the causes.
There aren't any glaring differences in the Java hprof profile. One small difference is that the SEF case appears to call KeyIndex.processNode more frequently. Adding a counter, we get 136458 calls in both cases, so this appears to be spurious. Unfortunately it isn't possible to get a -TP profile in the SEF case, because this relies on trace instructions being compiled into the SEF file.
The -TP output for the XSLT case has a bit of an oddity: it shows a total execution time of 1649.077ms, but the only entries in the breakdown show 299ms in add_model_ref on line 254, and less than 1ms in merge_on_ids on line 469. I wonder if this is the result of tail call optimisation?
Yes, sure enough, if I disable tail call optimisation, I get a profile with (dramatically) more information.
With tail call optimisation disabled, I get execution time of 585ms from XSLT source, 613ms from SEF, so this doesn't explain the difference.
Adding counters for all local variable and call-template evaluation gives no differences.
Please register to edit this issue