Bug #6446
closedXquery with "Group By" Is Slower in Saxon 11
100%
Description
Hi Saxonica, We are currently using Saxon9.9 and are trying to migrate to Saxon11.6; however, when using a "group by", we have noticed a significant increase in the performance. The increase for a simple xquery is around 10-15%. With more complex xqueries, the increase is around 20-30%.
Please find in the attached a sample xquery and a sample xml file. In average, it took 474ms with Saxon9 and 540ms with Saxon11 resulting in a ~14%increase.
Thank you for your time and help.
Files
testgroupby.xquery (3.71 KB) testgroupby.xquery | Kay L, 2024-06-05 17:10 | ||
testgroupby.xml (4.14 KB) testgroupby.xml | Kay L, 2024-06-05 17:10 | ||
testgroupby_saxon11.html (3.76 KB) testgroupby_saxon11.html | Kay L, 2024-06-06 11:42 | ||
testgroupby_saxon99.html (3.74 KB) testgroupby_saxon99.html | Kay L, 2024-06-06 11:42 | ||
testgroupby500.txt (35 KB) testgroupby500.txt | Kay L, 2024-06-10 11:21 |
Updated by Martin Honnen 6 months ago
Hi, consider adding details as to which edition of Saxon you use, i.e. whether your performance problems are with HE or PE or EE.
Updated by Michael Kay 6 months ago
Thanks for reporting it. I'll take a look at it but it won't be until next week when we get back from the XML Prague conference.
A 15% performance regression in one path isn't necessarily a bug; optimizers have to choose between different execution plans based on incomplete information and they won't always get it right. We also made some significant changes to Unicode string handling which we know make some things faster and other things slower. But I'll see, first, if I can reproduce your results, and secondly, if there's any obvious explanation for the difference.
As well as the information Martin identified (are you using HE, PE, or EE), it would be useful to know more about how you made the performance measurements, for example whether your figures include XML parsing time and query compilation time.
Updated by Kay L 6 months ago
- File testgroupby_saxon11.html testgroupby_saxon11.html added
- File testgroupby_saxon99.html testgroupby_saxon99.html added
Hi, thank you for your replies. We are using Saxon PE 9.9 and trying to migrate to Saxon PE 11.6. The time measurement includes the xquery compilation and execution. Using the tracer, for the loop with the group by, with Saxon11, it took 28.393 and with Saxon9, it took 24.398ms. Attached are the trace reports.
Updated by Michael Kay 6 months ago
My first attempt to run this, under the development branch I'm getting a query compile time of 199ms and an execution time of 0.75ms; under 9.9.1.8 I'm getting compile time of 122ms and execution time of 1.57ms.
So it could be that what we're seeing here is primarily slower query compilation. There has been a tendency for query compilation time to increase as we find more optimisations.
Is it possible in this workload to reuse the compiled query? It's not uncommon for query compile time to dominate in this way, but in many workloads it doesn't matter because the query is only compiled once and executed many times.
Updated by Michael Kay 6 months ago
Measuring a bit more carefully, on the development branch I'm seeing query compile time 4.26ms, source document build time 5.05ms, query execution time 0.77ms. Unfortunately Saxon 9.9 isn't set up to deliver that measurement breakdown so it's not possible to do a direct comparison without doing some coding against 9.9.
The figures you're reporting strongly suggest to me that most of what you are measuring is the warm-up cost of the Java VM, which dwarfs the figures for an individual small query. If your production workload is creating a new Java VM for each query, then you really need to fix that; if it isn't, then you need to change the way you are measuring performance.
Java warm-up time tends to increase with each release as the JAR files get bigger, although there was no significant growth between 9.9.1.8 and 11.4. But the time is also highly variable as it depends on the operating system clearing other processes out to free up memory for Java.
At this stage I think we need to establish exactly how you are taking your measurements.
Updated by Kay L 5 months ago
Hi @Michael Kay, Thank you for your reply. The test was done using Windows 11 JDK 11. Without including the measure for the compilation by enabling the "net.sf.saxon.trace.TimingTraceListener", in the attached trace reports: testgroupby_saxon11.html and testgroupby_saxon99.html, there is an increase of about 16% for the looping function with group by.
Updated by Michael Kay 5 months ago
The TimingTraceListener is useful for finding out which functions in your query are accounting for most of the cost, but the data it gives you for absolute timings is very distorted, because it injects many tracing calls into the code which cost at least as much as the actual query execution. Differences between releases could well arise because we inject more measurement points into the code. This is not reliable data for this purpose.
Updated by Kay L 5 months ago
Hi, Executing the xquery with the input attached and measuring only the execution time (no compile time included), the average time for Saxon 9 is 64ms and for Saxon 11 is 73.67 about a 15% increase. The xquery is run using SaxonProcessor.run(XMLReader reader, InputSource source, ContentHandler contentHandler) where the InputSource is from a FileInputStream and the ContentHandler is a StringWriter. The test was done using Windows 11 JDK 11. Thanks.
Updated by Kay L 5 months ago
This is how it is measured in the code:
SaxonProcessor proc = new SaxonProcessor(); proc.compile(xquery); long start = System.currentTimeMillis(); proc.run(reader, source, destination); long end = System.currentTimeMillis(); long duration = end - start; System.out.println(name + ": " + duration);
Thanks.
Updated by Kay L 5 months ago
Apologies for the bad formatting above. The code we are using is :
SaxonProcessor proc = new SaxonProcessor();
proc.compile(xquery);
long start = System.currentTimeMillis();
proc.run(reader, source, destination);
long end = System.currentTimeMillis();
long duration = end - start;
System.out.println(name + ": " + duration);
Updated by Michael Kay 5 months ago
Thanks.
Do you run the code repeatedly to eliminate Java warm-up time? Your figure for execution time is about 100 times larger than mine, which strongly suggests to me that warm-up time is playing a significant role here.
Updated by Kay L 5 months ago
- File testgroupby500.txt testgroupby500.txt added
Hi, running the test code 10x (eliminating java warm-up time), and increasing the size of the input to measure the difference, the average run is 5.67 for saxon9 and 6.667 with saxon 11, around 17.65% increase. Please find the sample input file attached.
Updated by Michael Kay 5 months ago
I've now dusted off my Saxon 9.9 development project, and I've constructed a performance unit test using my standard measurement framework, where I've tried to reproduce your code as closely as I can. In particular, it's doing the source document building an query execution within the measurement loop, and the query compilation outside.
You need to do a lot more than 10 runs to get the VM to settle down. I'm seeing an average time of 2.257ms for the first 500 runs, and an average of 0.663ms for the next 500.
Updated by Michael Kay 5 months ago
With the new data file I'm getting 1.780ms in 12.4, and 1.266ms in 9.9, averaged over the last 500 of 1000 runs.
If I take document building out of the loop, the figures are 0.564ms for 12.4 and 0.518 for 9.9.
If I just measure the document building, I get 1.084ms for 12.4 and 889ms for 9.9
There is indeed a small regression here. I suspect this is primarily caused by the changes to Unicode string handling. We made changes that do a little bit more work during document building in order to reduce the memory occupancy of the constructed tree. Some queries will run faster as a result of these changes (notably those that make heavy use of regular expressions), but others get no benefit. I'm afraid a lot of our performance work is driven by individual cases that we study, and it's quite common for improvements to one use case to have adverse effects for others.
Updated by Michael Kay 5 months ago
Looking more closely at the group-by evaluation, I notice that the grouping key is implemented as a ZenoString
which is a new data structure designed to improve efficiency when long strings are incrementally extended. When such a string is used as a key, it probably makes sense to consolidate it into a simpler structure with a more efficient equals() method. I'll experiment with this.
Updated by Michael Kay 5 months ago
Working now on the 12.x branch, the performance for the query only is coming in at 555microsecs. Consolidating the ZenoStrings brings it down to 448, which is certainly significant (and better than the 9.9 figure we started with).
Updated by Michael Kay 5 months ago
I thought it would be useful to examine why we are creating the ZenoString in the first place, and whether this is the right strategy.
The grouping key is the typed value of a c
element, which takes the form
<c>
<c1>CC13</c1>
<c2>CC22</c2>
</c>
so the typed value is actually a concatenation of 5 text nodes, three of which are whitespace.
However, the ZenoString that we construct isn't actually formed directly by creating the virtual concatenation of these nodes. Instead (code in TinyParentNodeImpl.getStringValue()
) we do the concatenation using a UnicodeBuilder
and then wrap the result in a ZenoString
. This seems to be a conflation of two separate strategies for reducing the overhead of string concatenation, and we don't need both: we should either build the result using a UnicodeBuilder
or using a ZenoString
, but not both. I'm inclined to use a ZenoString
, because this involves less character copying in the case where we're building a large string; that's unlikely to have much impact on this use case where we're concatenating five very short strings.
(Results edited to correct a measurement error)
Using a ZenoString without first using a UnicodeBuilder here gives an execution time of 578µs.
Using a UnicodeBuilder without then using a ZenoString gives 460µs.
Using both gives 678µs.
I'll change it to the UnicodeBuilder without wrapping in a ZenoString. Note that when the string becomes large, the UnicodeBuilder
uses a ZenoString
under the covers, so this should not lose scaleability.
Updated by Michael Kay 5 months ago
- Tracker changed from Support to Bug
- Category set to Performance
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch 12, trunk added
- Applies to branch deleted (
11) - Fix Committed on Branch 12, trunk added
- Platforms .NET added
In summary, I am making two changes (on the 12.x and development branches):
(a) In TinyParentNodeImpl.getStringValue()
, when forming the string value of a parent node with multiple children/ancestors, form the concatenated string value using a UnicodeBuilder
without wrapping the result in a ZenoString
.
(b) In TupleExpression.evaluateItem()
, if the value of the grouping variable is a ZenoString, condense it to a simpler structure, for more efficient equality comparison. (This is unnecessary for this use case having done (a), but it seems a good idea anyway.)
Updated by Michael Kay 5 months ago
We are in the final stages of preparing Saxon 12.5 for release (the SaxonJ build has been ready to ship for a while, but we've been fixing some final issues found while testing SaxonC). When it goes out, we will be very glad to receive your feedback on whether these changes have affected the bottom line performance.
Although the 11.x branch is still supported, I think it's unlikely we will be issuing any further patches unless the problems are extremely serious. A performance tweak wouldn't fall into that category.
Updated by O'Neil Delpratt 5 months ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.5 added
Bug fix applied in the Saxon 12.5 Maintenance release.
Please register to edit this issue