Bug #4867

Execution time from SEF exceeds execution time from XSLT source

Added by Michael Kay 21 days ago. Updated 21 days ago.

In Progress
Start date:
Due date:
% Done:


Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:


A user has submitted a test case where execution from XSLT source takes 1.64s, while execution from SEF takes 1.96s.


#1 Updated by Michael Kay 21 days ago

-TP output for the XSLT case shows the dominant code is a tail-recursive named template find_inherited_classifier at line 655 of cim16_pre_process.xslt.

First attempt to step through both cases in the debugger shows no obvious differences in execution path. In both cases tail recursion is being used; in both cases the key name has been resolved statically to a key definition.

The Java hprof output for the two cases shows no glaring differences.

#2 Updated by Michael Kay 21 days ago

Timings in development environment, 11 branch: Source XSLT 1.773s, SEF 2.014s.

Tried a few counters, chosen fairly randomly:

SEF counters
TinyElementImpl.copy = 116350
MemoClosure = 201716
FilterIterator.getNextMatchingItem() = 32219
CallTemplate.process = 48299

XSLT counters
TinyElementImpl.copy = 116248
MemoClosure = 217474
FilterIterator.getNextMatchingItem() = 32211
CallTemplate.process = 96367

Clearly we need to explore why CallTemplate.process is being called twice as often in the source XSLT case.

I added some more counters and the others are all much the same for the two cases:

NamedTemplate.expand() = 217480
CallTemplatePackage.processLeavingTail = 121217
new CallTemplatePackage() = 121215
CallTemplate.process = 96377
CallTemplate.processLeavingTail() = 121241

An unexpected observation though is that in the SEF case, the figures are the same every time, whereas in the source XSLT case, they differ slightly from one run to the next. I'm wondering if xsl:result-document multi-threading plays a role?

#3 Updated by Michael Kay 21 days ago

If I disable multithreading, the XSLT and SEF test cases now show essentially the same performance, though the discrepancy in the CallTemplate.process counter remains.

Investigating with the debugger, it seems that multithreading is not being used in the SEF case.

This would seem to explain the performance difference. It would be nice also to know why the counter is different...

#4 Updated by Michael Kay 21 days ago

Counter for variable on line 367

XSLT: let $inherited_attribute_classifier = 96307 
SEF:   let $inherited_attribute_classifier = 48272

Counter for xsl:apply-templates mode="pre_process_attribute" (line 512)

XSLT: Mode pre_process_attribute = 16012
SEF: Mode pre_process_attribute = 99

but in both cases the count for

  apply-templates mode="pre_process_class"  = 8
  apply-templates mode="pre_process" = 4

#5 Updated by Michael Kay 21 days ago

I have now fixed the SEF export/import so that xsl:result-document is asynchronous in both cases, and there is no longer any performance discrepancy.

But there are still substantial differences in the counters, and I would like to know why. I suspect some difference in the strategy for lazy evaluation of variables. Alternatively, it could be bytecode generation?

I'm looking at the apply-templates instruction on line 512, which appears (now) to be executed 16000 times in the XSLT case and not at all the SEF case. Yes, it does seem to be an effect of bytecode generation; the count is high in the SEF case when bytecode is disabled.

Since the default behaviour of bytecode generation is always going to be a bit unpredictable, especially when combined with multi-threading, and since this is having no adverse effect on performance, I think I won't pursue this any further.

#6 Updated by Michael Kay 21 days ago

  • Status changed from New to In Progress

I've now turned my attention to the second stylesheet, and (in the development branch) this is still showing faster execution from the .xslt file than the .sef - 585ms vs 661ms. With multithreading disabled, the timings are 664ms vs 632ms.

On the 10 branch, with multithreading disabled, the timings are 589 vs 624; with multithreading enabled, 551 vs 581.

The difference looks significant, but at about 7% it's not easy to investigate the causes.

There aren't any glaring differences in the Java hprof profile. One small difference is that the SEF case appears to call KeyIndex.processNode more frequently. Adding a counter, we get 136458 calls in both cases, so this appears to be spurious. Unfortunately it isn't possible to get a -TP profile in the SEF case, because this relies on trace instructions being compiled into the SEF file.

The -TP output for the XSLT case has a bit of an oddity: it shows a total execution time of 1649.077ms, but the only entries in the breakdown show 299ms in add_model_ref on line 254, and less than 1ms in merge_on_ids on line 469. I wonder if this is the result of tail call optimisation?

Yes, sure enough, if I disable tail call optimisation, I get a profile with (dramatically) more information.

With tail call optimisation disabled, I get execution time of 585ms from XSLT source, 613ms from SEF, so this doesn't explain the difference.

Adding counters for all local variable and call-template evaluation gives no differences.

Please register to edit this issue

Also available in: Atom PDF