Project

Profile

Help

Bug #3052

closed

Performance issue - differences between HE/PE/EE especially on .NET

Added by Michael Kay over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Category:
Performance
Sprint/Milestone:
-
Start date:
2016-11-30
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.6, 9.7
Fix Committed on Branch:
9.6, 9.7
Fixed in Maintenance Release:
Platforms:

Description

Performance issue reported by Jirka Kosek on the Sourceforge Saxon help list.

Bottom line execution times for a test stylesheet (with -TP option enabled), all with Saxon 9.7.0.13, on the same Windows machine:

on Java (1.7.0_60):

HE: 15955ms

PE: 16267ms

EE: 18000ms (bytecode off)

on .NET (4.0):

HE: 29525ms

PE: 29351ms

EE: 61379ms (bytecode off)

So EE is imposing an overhead of around 16% on Java, which magnifies to over 100% on .NET.

The absolute ratio between HE on Java and HE on .NET is probably nothing to worry about here, although it would be nice to reduce it. What we need to understand is (a) why EE is taking longer on the Java platform, and (b) why this difference should be so much larger on .NET.

Actions #1

Updated by Michael Kay over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Michael Kay over 7 years ago

Running with -TP shows no obvious anomalies. The extra costs on the slower platforms seem to be widely spread across different template rules and functions; the execution counts for different templates and functions appear to be the same.

Java profiling shows no obvious anomalies; the profiles for EE and HE are quite similar.

Running with -T produces a 2.6Gb trace file; the files produced by EE and HE are almost exactly the same size (a few Kb different) and a superficial look at the start of the files suggests that both products are following the same execution path.

Actions #3

Updated by O'Neil Delpratt over 7 years ago

I ran the executions again on Java (with the options -t -repeat:10), all with Saxon 9.7.0.13, on the same Windows machine:

on Java (1.7.0_60):

HE: 9534ms

EE: 11284ms (bytecode off)

Similar ratio, so it does not seem to be a warm-up issue.

Actions #4

Updated by Michael Kay over 7 years ago

The differences between the two -T traces related almost entirely to the fact that calls to the function lnuk-fn:is-toc-style have been inlined in the EE case.

We note that there are calls to this function in the predicate of a match pattern.

Hypothesis: inlining a function used within a match pattern could lead to patterns containing local variables and therefore requiring a stack frame to be allocated. (See SimpleMode.java line 437:

if (stackFrameSlotsNeeded > 0) context = makeNewContext(context);

Attempted a new build that suppresses inlining of function calls within a pattern. No significant difference in measurements: EE time changes from 11284ms to 11539ms.

Now running with -opt:0. Time is now 12349ms.

We've also compared the -export trees produced on Java and .NET. As far as we can tell, they are identical (the components are ordered differently so it's hard to be sure, but the size of the SEF files is identical).

Actions #5

Updated by Michael Kay over 7 years ago

We've been comparing the -explain output on HE and EE. We augmented this output with information about the static type of every expression to see if there are any differences caused by type inference (EE has been known in the past to be slower because HE knows the input data is untyped and therefore knows more type information than EE does, particularly in the case where very little type information is provided by stylesheet authors). Couldn't find any problems in this area; the type information (and a sample of arithmetic and comparison operations which typically benefit from type information) looked the same in both cases.

One difference we noted was that EE is doing general comparisons differently in some cases: it's using a gcEE operation where HE uses a gc operation. It's possible that the gcEE operation is more efficient for comparing long sequences, but less efficient for comparing short sequences - especially in cases where the inferred cardinality is "many" but the actual cardinality is one, which often happens. Next step is to explore whether eliminating this optimization in EE affects the numbers.

Actions #6

Updated by Michael Kay over 7 years ago

Switching off the gcEE optimization made performance slightly worse, so this doesn't appear to be the problem.

However, I'm still unhappy about gcEE. We're using it for 1:many comparisons as well as many:many, which doesn't feel right because I don't think the approach used gives any benefit for the 1:many case, and is likely to impose an overhead. In fact for a stylesheet like this one, where "as" attributes are notable by their absence, we are using gcEE even for things like test="$debug = 'true'" (where the cardinality of $debug is statically unknown, but is very likely to be a singleton).

An algorithm that is dynamically adaptive to the size of the two sequences would seem in many ways a better bet than a compile-time decision. In a sense gcEE is actually trying to be dynamically adaptive, but I still think it is imposing overhead in the common 1:1 case.

But this is a distraction from the problem we are trying to solve.

Our next step is to attempt to determine whether the difference between HE and EE has anything to do with optimization. We're going to run both HE and EE with optimization switched off. We're then going to examine the expression trees. In principle they should be the same - if they aren't, it's either because EE is doing the type analysis differently, or its because some optimization is still happening despite the configuration switch). If we find optimizations that are happening despite using -opt:0 then we should fix this. Once we're satisfied no optimizations are happening, then (a) if EE and HE performance are still different, we are looking for a cause other than optimization, (b) if they are now the same, the we have a case where EE is doing a (so-called) optimization that is actually counter-productive with this data.

Actions #7

Updated by O'Neil Delpratt over 7 years ago

I ran the experiments given in comment #3 again: Execution on Java (with the options -t -repeat:10), all with Saxon 9.7.0.13, on the same Windows machine:

on Java (1.7.0_60):

HE: 9736ms

EE: 9811 ms (bytecode off)

This is different to the results yesterday, but too close to draw any real conclusions under Java. The attention now switches to .NET

Actions #8

Updated by O'Neil Delpratt over 7 years ago

  • Status changed from In Progress to Resolved
  • Applies to branch 9.6, 9.7 added
  • Fix Committed on Branch 9.6, 9.7 added

The built Saxon-EE on .NET incorrectly has debug enabled, and fixing this solves the performance anomaly. Fixed on the 9.6 and 9.7 branches.

Actions #9

Updated by Michael Kay over 7 years ago

A note about the numbers on the Java platform:

On our first experiments, Saxon-EE was taking 10-15% longer than Saxon-HE.

The following day, running the same tests showed no difference.

I suspect the anomaly is something to do with memory pressure: the Saxon-EE JAR file is bigger and contains a lot more classes, so it is going to take longer to load, and depending on the state of the machine at the time this may or may not show up in the bottom line numbers.

The stylesheet we were testing is of the kind where it's difficult for Saxon-EE to find any optimizations. The code is already written to use keys where appropriate. It's not schema-aware, and there are no "as" attributes to give the kind of type information that the optimizer can take advantage of. It consists largely of very simple template rules with simple match patterns, so bytecode generation doesn't achieve very much. So there's no real expectation here that Saxon-EE should perform better -- only that it should perform at least as well.

Actions #10

Updated by Jirka Kosek over 7 years ago

I can confirm that before "manual" optimization (keys, removing //, preceding, following axes where possible) Saxon-EE (on Java) outperformed both HE/PE. I will try to add more type information to see if we can gain some benefit.

I have also tried to use memoization on functions that are called many times and contribute significantly to overal running time. But there was no difference -- could it be that Saxon-EE can decide to inline such functions even if there is memoization hint?

Actions #11

Updated by Michael Kay over 7 years ago

could it be that Saxon-EE can decide to inline such functions even if there is memoization hint?

Yes, I think you could be right. I'll add a new issue.

Actions #12

Updated by O'Neil Delpratt over 7 years ago

  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.6.0.10, 9.7.0.14 added

Bug fix applied in the Saxon 9.6.0.10 maintenance release.

Bug fix applied in the Saxon 9.7.0.14 maintenance release.

Actions #13

Updated by O'Neil Delpratt over 7 years ago

  • Status changed from Resolved to Closed

Please register to edit this issue

Also available in: Atom PDF