Project

Profile

Help

Bug #6570

open

Concurrent Performance Issues SaxonEE10-9N (dotnet)

Added by Arif Yayalar 2 days ago. Updated about 20 hours ago.

Status:
New
Priority:
High
Assignee:
Category:
.NET API
Sprint/Milestone:
-
Start date:
2024-10-24
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:
.NET

Description

Hello there,

We're using SaxonEE10-9N in our dotnet application to transform for XSLT transformations. We've noticed under a light load (e.g. 10 concurrent requests) the performance degrades drastically. For example, in my local machine, I run 10 concurrent requests, it takes about 30 sec or more to complete each transform. If I increase the number to say 25, I see transform times go up to 100 seconds per transform.

Please see attach dotnet example I've attached to repro this issue.


Files

Screenshot 2024-10-24 132551.png (65.9 KB) Screenshot 2024-10-24 132551.png Arif Yayalar, 2024-10-24 19:27
Actions #1

Updated by Michael Kay 2 days ago

  • File deleted (saxon-test.zip)
Actions #2

Updated by Michael Kay 2 days ago

Thanks for reporting it.

I've deleted the attached ZIP file because it contained a Saxon licence key.

What level of CPU and memory utilisation are you seeing? If the CPU is fully utilised with 10 threads, then (unless you have more than 10 CPUs) doubling the number of threads means that each thread will take twice as long to complete, which would account for your observations.

I guess this is a synthetic workload, but in real life it's common to run the same transformation on many source documents, and in that situation you can get a big performance benefit by only compiling the stylesheet once.

If your monitoring tools show any lock contention within Saxon, then it would be useful to know about that.

Saxon10 on .NET is still supported, because .NET Framework 4.8 is still supported. However, it is becoming increasingly unlikely that we will produce any further maintenance releases for Saxon10, and we wouldn't do so for a performance issue unless it makes the product completely unusable.

Actions #3

Updated by Arif Yayalar 2 days ago

Hi Michael,

Thanks for replying back and deleting the file. Unfortunately we cannot cache/compile once. We have quite of few of them (in the 1000s). I don't know if you had a chance to run the solution locally but just 10 concurrent calls taking 30+ seconds each. As far as CPU utilization is around %10 but I see spikes up to %50-60. Memory utilization it starts around 500mb but consistently climbs up. I see numbers around 3.8 GB five minutes into the test. A lot of GC activity as well (see attached screen shot)

As far as the version of the software, isn't SaxonEE10-9N latest? If not could you point me where I can download the latest?

I am also curios if we're using the library efficiently. Do you see any issues in the code base we should be aware of?

Thank you,

Arif

Actions #4

Updated by Michael Kay 2 days ago

Saxon10.9 is the latest version that runs on .NET Framework, but for .NET Core (or simply .NET) you should be using SaxonCS 12.6. Both are built from the Java source code, but in very different ways, so performance characteristics might be different. I don't know if your license will cover the latest version.

I haven't had much chance to look at your code. The XSLT code is Altova-generated code so it's pretty verbose and not easy to follow the logic. My immediate reaction would be that 30 seconds seems very slow. I tried it with SaxonJ from the command line (I don't have a Windows machine handy) and it took 977ms. So there's clearly something going wrong somewhere.

What timing do you get running a single transformation from the command line?

I looked at your C# driver code and that looks fine.

Actions #5

Updated by Arif Yayalar 2 days ago

just a single tranformation takes between 3-4 seconds

Actions #6

Updated by Michael Kay 2 days ago

OK, that's consistent with my Java figure, because the .NET product is often around a factor of 3-4 slower than Java (for reasons we have never fully understood). If it's taking 3 seconds in a single thread and 30 seconds with 10 threads that suggests that there's no real multithreading going on at all. What does the Console log look like?

Actions #7

Updated by Arif Yayalar 2 days ago

We run the requests in Parallel.

Parallel.For(0, 10, _ =>
{
    saxon.RunTransformerEntEd();
});

From the console output, it looks like it is not running in parallel. Looks like it is executing serially. I wonder if there is a locking in Saxon API prevents these calls to execute in parallel.

Console output: Compile time: 1193 Compile time: 1193 Compile time: 1193 Compile time: 1193 Compile time: 1185 Compile time: 1193 Compile time: 1193 Compile time: 1193 Compile time: 1193 Transform time: 35974 Transform time: 38518 Transform time: 38747 Transform time: 38758 Transform time: 38772 Transform time: 38798 Transform time: 38864 Transform time: 38865 Transform time: 38872 Transform time: 38878

Actions #8

Updated by Michael Kay 2 days ago

Well, there's clearly some parallelism there, because it's running all the compiles, then all the transformations. But it does look rather like some lock is holding things back.

Actions #9

Updated by Arif Yayalar 1 day ago

Is there anything we can do the increase the performance?

Actions #10

Updated by Michael Kay 1 day ago

I haven't identified anything yet, but will work on it.

Actions #11

Updated by Michael Kay 1 day ago

I've now run your test case against SaxonCS 13.0 (under development) on .NET 6.0 on Mac. I know that's not your target environment but it's easiest for me to investigate on my current platform rather than setting up something new.

I'm getting the timings

Compile time: 395
Compile time: 395
Compile time: 395
Compile time: 395
Compile time: 396
Compile time: 394
Compile time: 395
Compile time: 395
Compile time: 395
Compile time: 396
Transform time: 19375
Transform time: 19386
Transform time: 19400
Transform time: 19405
Transform time: 19414
Transform time: 19428
Transform time: 19429
Transform time: 19432
Transform time: 19435
Transform time: 19475
Completed...

If I change the number of threads to 4, I get

Compile time: 277
Compile time: 277
Compile time: 277
Compile time: 277
Transform time: 5643
Transform time: 5643
Transform time: 5643
Transform time: 5645

and with 1 thread I get

Compile time: 236
Transform time: 2403
Actions #12

Updated by Michael Kay 1 day ago

I added a trace at the point where the transform starts, and I'm getting

Compile time for : 4 = 289
Compile time for : 1 = 289
Compile time for : 0 = 289
Compile time for : 3 = 289
Compile time for : 2 = 289
Start transform : 0
Start transform : 2
Start transform : 3
Start transform : 4
Start transform : 1
Transform time for : 2 = 7776
Transform time for : 3 = 7776
Transform time for : 4 = 7776
Transform time for : 1 = 7777
Transform time for : 0 = 7779

So it's pretty clear that they are running concurrently rather than sequentially.

I'll come back to this after the weekend.

Actions #13

Updated by Michael Kay 1 day ago

Sorry, I wasn't changing the number of threads, I was changing the total number of transformations performed. What we're seeing essentially is that running them in parallel takes more or less the same total amount of time as running them sequentially.

Actions #14

Updated by Arif Yayalar 1 day ago

if I run them serially, I see better results. However, this is not ideal.

Compile time: 67
Transform time: 3535
Compile time: 61
Transform time: 1964
Compile time: 51
Transform time: 1980
Compile time: 49
Transform time: 2436
Compile time: 33
Transform time: 2268
Compile time: 53
Transform time: 1890
Compile time: 46
Transform time: 2434
Compile time: 58
Transform time: 1909
Compile time: 51
Transform time: 1851
Compile time: 51
Transform time: 1842
Compile time: 47
Transform time: 1949
Compile time: 47
Transform time: 2310
Compile time: 49
Transform time: 2451
Compile time: 48
Transform time: 1974
Compile time: 45
Actions #15

Updated by Michael Kay 1 day ago

What are you trying to achieve? Are you trying to optimize throughput or latency?

Actions #16

Updated by Arif Yayalar 1 day ago

Both. We get 100s of requests per sec that we need to keep the xslt compile/transform as efficient as possible.

Actions #17

Updated by Michael Kay about 23 hours ago

Well, it's going to be a trade-off, isn't it?

I've tried running 20 transformations with different number of threads, and the figures (threads / av. transform time / total time / throughput):

2 / 2998 /31943 /  0.62
3 / 4218 / 31025 / 0.65
4 / 5347 / 28704 / 0.69
5 / 7763 / 32240 / 0.62
6 / 9948 / 34707 / 0.57

10 / 20636 / 42734 / 0.46

I'm running this on a machine with 4 cores, so as you would expect for a CPU-dominated workload, the throughput is highest with a concurrency of 4, but the latency is smaller with a lower number of threads. There is no benefit in having more threads than you have processors available, This is all very much what one would expect.

I don't know how many threads Parallel.For allocates by default. I'm sure there are more sophisticated ways of tuning the scheduling of parallel workloads on .NET, but the basic pattern is likely to remain the same.

Actions #18

Updated by Michael Kay about 22 hours ago

I thought it might be interesting to look at the actual XSLT code, so I profiled it (on Java) using -TP:profile.html All this reveals is that nearly all the time is spent in the xsl:template match="/" root template, but that's hardly surprising since that accounts for 99% of the stylesheet. With monolithic code like this, the profiling isn't much use.

It's 10,000 lines of very turgid generated code, so no real prospect of examining it by hand.

A Java profile reveals nothing untoward.

I now there have been problems with performance with very large templates/functions like this in the past, but IIRC they were compile-time problems rather than run-time, and the compile time for the stylesheet looks OK.

Actions #19

Updated by Michael Kay about 21 hours ago

OK, let's stand back.

You say you need to handle hundreds of requests a second, and each request is taking two or three seconds of CPU to execute, when run in a single thread. That's only going to be viable if you distribute the workload across hundreds of CPUs.

I strongly suspect the XSLT code could be made to go much faster. On the face of it it looks like a fairly uncomplicated transformation, Because it is generated code, it's doing a lot of unnecessary work, for example there are far more local variables than are needed. Saxon gives up on trying to optimize a template/function as big as this, because the cost of optimization increases exponentially with the size of the function or template. I suspect that means it's not even going to do trivial optimizations like replacing the expression xs:double(xs:decimal('2')) (line 11742) by the constant 2.0e0. But rewriting it by hand is a fairly daunting prospect (especially as you suggest this is just one of hundreds of different stylesheets), However, I would think it's feasible for a hand-written stylesheet to do the job in under 100ms.

Perhaps, because it's generated code, it's regular enough that you could consider attempting some automated rewrites? I've no idea what kind of improvement you might expect but it might be worth experimenting.

The only alternative seems to be to buy lots of hardware (or cloud capacity) and put in some load-sharing infrastructure to distribute the workload.

Running on Java rather than .NET might give you a 3-4 times improvement, which would be a good start.

Actions #20

Updated by Michael Kay about 20 hours ago

Running on Java rather than .NET might give you a 3-4 times improvement, which would be a good start.

In fact on Java it's running in about 117ms, which is a much more reasonable time for this kind of transformation.

Please register to edit this issue

Also available in: Atom PDF