Bug #6090
closedPerformance regression SaxonCS 12
Added by Joel Kalvesmaki over 1 year ago. Updated about 1 year ago.
100%
Description
The attached input and XSLT stylesheets show evidence for significant performance regression in SaxonCS, especially 12.
Method | Environment | Saxon prod | Saxon Version | Time (ms) |
---|---|---|---|---|
Oxygen | java | PE | 9.9 | 50000 |
CLI | java | HE | 10.6 | 46564 |
CLI | .net | EE | 10.9 | 158244 |
CLI | .net | EE | 11.5 | 264665 |
CLI | .net | EE | 12.2 | 596690 |
I have determined that something significant is happening in the process from $pass-1b to $pass-2a. When outputting only $pass-2a, and commenting out the templates in mode pass-2a-bills with priorities 99 and 100 (lines 108, 122 in the pass2 file), results are faster, but the penalties under C# 12.2 are more pronounced:
Method | Environment | Saxon prod | Saxon Version | Time (ms) |
---|---|---|---|---|
Oxygen | java | PE | 9.9 | 14900 |
CLI | java | HE | 10.6 | 17206 |
CLI | .net | EE | 10.9 | 51643 |
CLI | .net | EE | 11.5 | 96078 |
CLI | .net | EE | 12.2 | 206967 |
I did what I could to ensure that the tests were identical in all ways except the processor and environment, as noted above.
The regression is noticeable only in files megabytes in size. Smaller files are processed quite rapidly in SaxonCS.
Not sure if this situation is comparable to #5914 but I cite it for the sake of quick cross-reference.
Files
performance.zip (631 KB) performance.zip | Joel Kalvesmaki, 2023-06-23 02:12 | ||
6201.PCF.xml (43.6 KB) 6201.PCF.xml | Joel Kalvesmaki, 2023-07-05 22:07 | ||
6211.PCF.xml (74.2 KB) 6211.PCF.xml | Joel Kalvesmaki, 2023-07-05 22:07 | ||
6651.PCF.xml (24.9 KB) 6651.PCF.xml | Joel Kalvesmaki, 2023-07-05 22:07 | ||
6652.PCF.xml (48.9 KB) 6652.PCF.xml | Joel Kalvesmaki, 2023-07-05 22:07 | ||
slax.rng (30.5 KB) slax.rng | Joel Kalvesmaki, 2023-07-06 00:05 | ||
input-reduced.xml (467 KB) input-reduced.xml | Matt Patterson, 2023-07-06 12:21 |
Updated by Vladimir Nesterovsky over 1 year ago
For the reference, per my older request Performance of SaxonHE 10.6 vs 11.2 the regression has started after 10.6 even in java.
Updated by Michael Kay over 1 year ago
Thanks for reporting it. You're clearly doing something that our performance tests don't cover!
Updated by Michael Kay over 1 year ago
I'm sorry, I've only just got around to investigating this (we decided to get 12.3 out first, and it took longer than it should have). And now I find that I can't actually run the stylesheet:
Error in expression in xsl:variable/@select on line 46 column 106 of slax2typesetting-pass1.xsl:
FODC0002 I/O error reported by XML parser processing
file:/Users/mike/bugs/2023/6090-Kalvesmaki/SlaxBuilders/formats/source%20as%20xml/6201.PCF.xml: /Users/mike/bugs/2023/6090-Kalvesmaki/SlaxBuilders/formats/source as xml/6201.PCF.xml (No such file or directory): /Users/mike/bugs/2023/6090-Kalvesmaki/SlaxBuilders/formats/source as xml/6201.PCF.xml (No such file or directory)
Could you supply the missing files please?
I'm probably going to start by investigating the performance on Java because we have a lot more tooling, instrumentation, and experience on that platform. I'll then move over to .NET to see if it's any different.
The fact that it's running 3 times slower on .NET than on Java is not something that greatly surprises us, we've seen the same effect ourselves, and it's true both of the old Saxon/.NET product built using IKVMC and the new SaxonCS product built by transpiling. We have no idea why it should be the case, despite much investigation.
In our own measurements the "elaboration" mechanism introduced in Saxon 12 gave a very small (0-5%) speed-up in Java, and a much larger (20%) speed-up in SaxonCS. Clearly you're not seeing that with this workload.
Updated by Joel Kalvesmaki over 1 year ago
- File 6201.PCF.xml 6201.PCF.xml added
- File 6211.PCF.xml 6211.PCF.xml added
- File 6651.PCF.xml 6651.PCF.xml added
- File 6652.PCF.xml 6652.PCF.xml added
I'm not terribly surprised by the Java - C# difference, and I'm sure with some time the differential will start to diminish.
I think you'll find that for this input file, four of those PCF.xml files are requested, so I've attached them. I trust you'll reconfigure the directories as needed. Let me know if I can supply anything else helpful.
Updated by Michael Kay over 1 year ago
Thanks. It's now looking for file:/Users/mike/bugs/2023/6090-Kalvesmaki/schemas/slax.rng
Updated by Joel Kalvesmaki over 1 year ago
attached
Updated by Michael Kay over 1 year ago
OK, thanks, now running the test successfully. First results:
SaxonJ-EE 12.3 - 156380ms
SaxonJ-EE 11.5 - 120141ms
Not as big a difference as you're reporting, but enough to be worth investigating.
Updated by Michael Kay over 1 year ago
I've run both versions with -TP:profile.html and there's one difference in the profiles that immediately stands out: function slax:batch-replace
is called 99 times under SaxonJ-11, but 101,858 times under SaxonJ-12. The function doesn't itself account for a significant amount of the processing time, but the difference in execution counts indicates an adverse change in the execution strategy adopted by the optimizer.
Updated by Matt Patterson over 1 year ago
- File input-reduced.xml input-reduced.xml added
Reduced input file
Updated by Michael Kay over 1 year ago
I added internal instrumentation and it appears that the anomaly in the counter for slax:batch-replace
is spurious: there are actually 101K calls in both 11.x and 12.x, but the tracing for 11.x isn't counting "push-mode" function calls for some reason.
On a whim, we also tried running with the call on normalize-unicode()
at line 834 of the pass-1 stylesheet removed; it made no significant difference.
Updated by Michael Kay over 1 year ago
I have been comparing the SEF output for 11.x against 12.x
(To do this, I wrote a transformation that sorted the SEF into a canonical order and deleted attributes such as component ID that are allocated arbitrarily; then compared using the Oxygen XML diff tool).
I found very few significant differences. The only ones with apparent performance implications are:
(a) 11.x uses an indexed filter for the first predicate on line 634 of pass2.xsl (in template match="para"), 12.x doesn't.
(b) 12.x uses an indexed filter for the predicate on line 585 of pass2.xsl (in function slax:bill-leaderwork-key-item, which gets inlined into the calculation of the grouping key at line 592).
Judging from the -TP output, neither of these differences are on performance-critical paths.
The other structural differences in the SEF execution plans are mainly rearrangement of multiple predicates in a filter expression (or rewriting to a single predicate using an "and" operator) which seem unlikely to make a major difference.
Updated by Michael Kay over 1 year ago
Running under Profiler, in 12.x I'm seeing a strong hotspot (33%) at ValueComparisonElaborator.elaborateForBoolean()
and another (25%) at PullElaborator.elaborateForItem$1
After that we get NamedChildIterator.next()
(11%), StringTool.getStringLength()
(8%), and StringValue.equals()
(7%)
The calls to ValueComparison are coming from FilterIterator.nonNumeric.matches()
which it seems is failing to elaborate the predicate expression before evaluating it, causing repeated work (such as cardinality checking) each time the predicate is evaluated.
The calls to PullElaborator.elaborateForItem$1 happen in the course of ValueComparisonElaborator.elaborateForBoolean()
.
Updated by Michael Kay over 1 year ago
So, why is ValueComparison not using the elaboration mechanism? Because someone is calling FilterExpression.iterate() without elaborating it first. This seems to originate with GeneralNodePattern.isSelected()
.
It's worth noting here that a GeneralNodePattern
is bad news - it means that we've got a complex match pattern without a fast evaluation strategy. The pattern in question is para[@format eq '6651'][@loc eq '46'][string-length(.) gt 0][1]
on line 306 of pass2.xsl. It's the final '[1]' that causes the problems. A better way of handling special cases like this would be to bind a global variable:
<xsl:variable name="para-6651-46-first" select="//para[@format eq '6651'][@loc eq '46'][string-length(.) gt 0][1]"/>
and then write the pattern as match="$para-6651-46-first"
. Saxon can't do this optimization because the pattern matches nodes in any document, not just the principal input document.
But that's a side note, our concern is with Saxon inefficiencies not with stylesheet inefficiencies.
The method GeneralNodePattern.isSelected()
takes an expression as argument; the expression should be elaborated first. Unfortunately, though, the elaboration mechanism has never been fully implemented for patterns (as distinct from expressions). But we can do it on a case-by-case basis.
Updated by Michael Kay over 1 year ago
I've changed FilterExpression.iterate()
to elaborate the expression before evaluating it, and this is showing a substantial improvement: execution time down from 156380ms to 103073ms which is significantly better than the 11.x performance.
I think we can do even better by taking a look at GeneralNodePattern
.
Updated by Joel Kalvesmaki over 1 year ago
Thank you, these notes are very informative and helpful.
Updated by Michael Kay over 1 year ago
Changing GeneralNodePattern to retain the pre-elaborated evaluator for the equivalent expression gives a further improvement to 89655ms.
Updated by Michael Kay over 1 year ago
On SaxonCS 12 before applying the changes the timing (for me) is 1115.669s. With the changes, this comes down to 513.575s.
On SaxonCS 11 my timing for the same run is 613.894s
Updated by Michael Kay over 1 year ago
I'm now looking further into the possibility of improving the match performance of a GeneralNodePattern.
This currently follows the algorithm in the spec: there's an equivalent expression EE, and to match a node N we search for an ancestor A such that evaluating EE with A as the context item returns a node-set that includes the target node N.
But if the pattern has the form para[@format='X'][last()]
then we ought to be able to do better. We should be able to extract the predicate [@format='X']
as a precondition, and avoid the ancestor search if the precondition is false.
Moreover/Alternatively, if the equivalent expression is only looking at the child axis (as here), then the ancestor search can be truncated to only look at the immediate parent.
Updated by Michael Kay over 1 year ago
In attempting to implement improvements, I found a bug: XSLTempate.register
makes a copy M1 of the match pattern M (needed because it may be changed) and stores the copy M1 in the RuleManager. But it doesn't (at least not on all paths) save the copied version in the XSLTemplate.match field. Subsequently XSLTemplate.optimize() calls match.optimize() on the original pattern M to produce M2, but the RuleManager still holds M1 rather than M2. So any changes made during the optimize() call have no effect.
At any rate, this is what happens when deferred template rule compilation is set, which is the default in Saxon-EE.
I've changed XSLTemplate.optimize()
to put the optimized pattern in the TemplateRuleEE
object, but that's not enough: it also needs to go in the Rule
object. Fortunately we can get from the TemplateRuleEE to the relevant Rule objects.
Updated by Michael Kay over 1 year ago
I now get a weird error "Cannot validate a NOTATION without a namespace resolver". If I run it in the debugger, it changes to "Cannot validate a QName without a namespace resolver".
We seem to be evaluating the troublesome GeneralNodePattern here; I suspect the problem is that it hasn't been type-checked. No, that doesn't help.
It's failing in an internal function map:untypedContains() which attempts to see if an untypedAtomic value is present in a map, by converting to all the types that might be present in the map. This call results from optimization of a general comparison, specifically
(6211, 6212, 6213, 6214, 6411, 6412, 6413, 6414, 6311, 6313) = data(@format)
We tackle this by building a map and testing for contains(), but because data(@format)
is untyped, we have to be prepared to convert to the target type. In fact all these values are strings, except the final one (6313) which is an integer. We are converting to every type that might be present in the map, in turn, and for some reason we think that this includes QNames and Notations.
When we first build the map, the type data (keyUType property) is correct (string|decimal) but when we come to do the lookup, it has been changed to a universal atomic type.
This is happening as a side-effect of a call on HashTrieMap.getItemType()
, called by TypeChecker.staticTypeCheck()
.
There are a number of issues here. Firstly, we shouldn't really be calling getItemType() - it's intended for diagnostic use, and the result isn't particularly well defined because an item conforms to many different types. Secondly, we're computing it badly in the case of a map: if there's an integer and a string, we conclude that it's anyAtomicType, which is why we end up thinking the map might contain a QName.
I've changed HashTrieMap.getItemType() to form a UType union, which captures the type more precisely.
Updated by Michael Kay over 1 year ago
With these changes, the test run is taking 9437ms, down from 156380ms.
Yes, I checked: the output is identical to the 11.x output.
Updated by Michael Kay over 1 year ago
- Category set to Performance
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch 12, trunk added
- Fix Committed on Branch 12, trunk added
- Platforms Java added
Updated by Joel Kalvesmaki over 1 year ago
Amazing, to be able to reduce performance time by more than an order of magnitude. Many thanks!
Updated by O'Neil Delpratt about 1 year ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.4 added
Bug fix applied in the Saxon 12.4 maintenance release
Please register to edit this issue