Project

Profile

Help

Race condition / dead lock?

Added by Anonymous about 13 years ago

Legacy ID: #9088168 Legacy Poster: Tobias Freudenreich (sirrichie)

Hi, we are using SAXON 9.3.0.4 to transform thousands of messages per second on a server with 8 cores and enabled Hyper Threading (so a total of 16 cores). We observe the following behavior: At some point, the server's CPU utilization jumps up to 100 on all cores and operation virtually stops. Sometimes however, this does not happen at all (at least not for 30 minutes). Doing some investigation on this, we found that this is more likely to happen and happens earlier with higher load (more xml transformations per second). We have two Templates instances and share these among threads. We create a new Transformer for each transformation. We believe that this might be a glitch, though suggestions on how to tackle this are always welcome. Thanks, Tobias


Replies (14)

Please register to reply

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9088280 Legacy Poster: Michael Kay (mhkay)

(1) Has this workload been run successfully on an earlier release? Is there any evidence that the problem is something new in Saxon 9.3? (2) My instinct with concurrency problems under high throughput is to suspect the NamePool. It would be useful to know whether the number of names in the NamePool is stable, or whether it is still increasing at the time of the failure. Some activities are much more intensive in their NamePool use than others, for example using TinyTrees causes very few synchronisation requests on the NamePool, whereas using DOM trees can impose a very heavy load. So it's important to understand the workload. If you can get some instrumentation of NamePool activity, that will certainly be helpful. Synchronization however doesn't really seem consistent with your observation that all cores are busy; it would normally lead to them all waiting. But it could still be NamePool related, because as the NamePool becomes more full, its performance declines.

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9088524 Legacy Poster: Tobias Freudenreich (sirrichie)

Hi again and thanks for the quick response, (1) No. We are "new" to Saxon and 9.3 is the first version of Saxon we ever use. We tried 9.2 as well, with a similar result (except that only 8 of the 16 cores were "busy" (see below)) (2) What exactly do you mean by NamePool? We are not deeply involved with Saxon itself, as we are just using the XSLT/XPath part of it (using Java's official API). So basically, we use Saxon to transform rather short xml documents (~50 lines) into other xml documents (~50 lines). We observed our server with the unix tool htop (http://en.wikipedia.org/wiki/Htop) where all cores were at 100%. However, you might be correct that the cores were actually waiting, rather than being busy, as the temperature did not increase (even though being run for a solid 45 minutes at 100%).

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9088773 Legacy Poster: Michael Kay (mhkay)

Well, I'm impressed that you've got as far as you have when you are new to the product and have no knowledge of its internals! I'll advise you upfront that when it comes to helping maximize performance, we're prepared to give advice and information about how Saxon behaves, but collecting the workload information needed to solve the problem is your responsibility. Even when people have bought the Enterprise Edition, this comes into the area of chargeable consultancy. Saxon turns all names used in queries and stylesheets, and all names used in instance documents, into integer codes. The mapping between codes and QNames is held in the NamePool, which belongs to the Configuration, which belongs to the TrasnformerFactory in the case where you are using JAXP interfaces. This is shared across all threads. There is some caching to reduce contention - but this has less impact when transforming small documents than when transforming large ones. Try writing a thread that uses the same TransformerFactory as the live workload, and once a minute or so examine the NamePool contents. To do this, navigate from the net.sf.saxon.TransformerFactory to the Configuration to the NamePool, and call its statistics() method. The output goes to System.err. This will show if the number of names is growing unreasonably. Check to see whether you are using a DOMSource unnecessarily. Using the DOM rather than Saxon's optimized TinyTree is the cause of a great number of Saxon performance problems. Michael Kay Saxonica

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9088806 Legacy Poster: Tobias Freudenreich (sirrichie)

Hi, we are not exactly looking into optimizing the performance, we are trying to get Saxon to work reliably :) I will look at the NamePool tomorrow or throughout the weekend to see if the number of names grows considerably. But getting back to the problem: the real problem is, that CPU utilization is at about 30% constantly and then suddenly jumps to 100%, locking out any other operations (we are sure that Saxon causes this, as commenting out the transform call on the transformer causes the problem to disappear). Again, this "suddenly" is no fixed point in time, sometimes Saxon successfully processes over 200000 xml documents, sometimes it "stalls" after about 5000. We are using StreamSource exclusively, so neither a DOMSource nor a TinyTree.

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9105940 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

At some point, the server's CPU utilization jumps up to 100 on all cores and operation virtually stops. > Sometimes however, this does not happen at all (at least not for 30 minutes). This, in fact, can be a manifestation of GC in progress. Please run GC profiler to verify this. -- Vladimir Nesterovsky http://www.nesterovsky-bros.com/

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9110870 Legacy Poster: Tobias Freudenreich (sirrichie)

I did some profiling, but it is not the GC. In fact, CPU utilization stays at 100% for at least 45 minutes. The profiling also shows that the GC runs several times during operation, showing slight peaks in utilization, but with no major impact. We still suspect that there's a deadlock or livelock somewhere within Saxon.

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111239 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

CPU utilization stays at 100% for at least 45 minutes. This does not look as a lock, neither as infinitive loop. My guess is the problem is either not in Saxon, or in a stylesheet itself (it's not too hard to build such slow xslt). > We still suspect that there's a deadlock or livelock somewhere within Saxon. You can add logging (e.g. before and after transformation), xsl:message(es). This will help you to pin point the scope of problem. -- Vladimir Nesterovsky http://www.nesterovsky-bros.com/

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111263 Legacy Poster: Tobias Freudenreich (sirrichie)

We already added logging and once CPU utilization jumps to 100%, it stops. If it was a slow stylesheet, then I would expect the same behavior every time, but as I already said, the point in time at which operation stops, is different for each run. Plus the same xml-data is transformed thousands of times. Another indicator is that if we synchronize (with Java's synchronize statement) the calls to Saxon, the problem disappears (but we have extremely poor performance). The server's temperature does not increase significantly when it runs at 100% either, which we believe is due to NOPs executed in the waiting cycle. Thus, we believe that it might be a bug in Saxon and wanted to help sort it out.

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111357 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Well, if it's the lock then you can add a guard to interrupt execution and print stack trace. This stack will help to investigate the problem. That's an example of such a guard: private static void lockTest() throws Exception { final Thread thread = Thread.currentThread(); Future<?> feature = executor.schedule( new Runnable() { @Override public void run() { thread.interrupt(); } }, 10, // This is operation timeout. TimeUnit.SECONDS); try { doWork(); } catch(InterruptedException e) { // Log it. e.printStackTrace(); } feature.cancel(true); } private static void doWork() throws Exception { // A logic with lock. } private static final ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(1); -- Vladimir Nesterovsky http://www.nesterovsky-bros.com/

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111388 Legacy Poster: Michael Kay (mhkay)

Well, if it's the lock then you can add a guard to interrupt execution and print stack trace. Thanks for that suggestion, Vladimir, which sounds very constructive. I knew it cound be done because I've seen other users do something similar, but I wasn't sure exactly how it was done. It certainly provides a possibility of getting some useful diagnostics, which is what we're missing so far on this case.

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111470 Legacy Poster: Tobias Freudenreich (sirrichie)

Hi, thanks for the replies. I tried your code Vladimir. To make sure I use it correctly, I first tried a very small test-method and basically the doWork method had a while (true) loop. I also put a system out to the Runnable, making sure to see if it gets called. Here is what happens: After 5 seconds (I reduced the timeout to 5 seconds) the run method gets called, but no stack trace follows and doWork() continues executing. Guess I am doing something wrong there?

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111487 Legacy Poster: Tobias Freudenreich (sirrichie)

Sorry for double-posting, I forgot to add my test code: import java.util.Date; import java.util.concurrent.Future; import java.util.concurrent.ScheduledThreadPoolExecutor; import java.util.concurrent.TimeUnit; public class Test { public static void lockTest() throws Exception { final Thread thread = Thread.currentThread(); Future<?> feature = executor.schedule(new Runnable() { @Override public void run() { System.out.println("interrupting!"); thread.interrupt(); } }, 5, // This is operation timeout. TimeUnit.SECONDS); try { doWork(); } catch (InterruptedException e) { // Log it. e.printStackTrace(); } feature.cancel(true); } private static void doWork() throws Exception { Date begin = new Date(); int i = 0; while (true) { i++; if (i % 100000 == 0) { Date now = new Date(); System.out.println(now.getTime() - begin.getTime()); } } } private static final ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor( 1); public static void main(String[] args) throws Exception { lockTest(); } }

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111545 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

According to the help: "If this thread is blocked in an invocation of the wait(), wait(long), or wait(long, int) methods of the Object class, or of the join(), join(long), join(long, int), sleep(long), or sleep(long, int), methods of this class, then its interrupt status will be cleared and it will receive an InterruptedException." If wait() is somehow involved than you will recieve an exception. In case of wild loop Thread just marks some flag. Anyway, you can get a stack snapshot without interruption: @Override public void run() { thread.interrupt(); for(StackTraceElement frame: thread.getStackTrace()) { System.out.println("\tat " + frame); } } -- Vladimir Nesterovsky http://www.nesterovsky-bros.com/

RE: Race condition / dead lock? - Added by Anonymous about 13 years ago

Legacy ID: #9111692 Legacy Poster: Tobias Freudenreich (sirrichie)

Hi again and thanks for the help. I used Vladimirs guard to see the thread is waiting, which doesn't seem to be the case, because no stacktrace is printed out (even though the symptoms described above appear). I re-verified that commenting out the call to transform on the Transformer instance, causes the symptoms not to appear. So I used the executor to just print out the stack trace after 6 minutes (by which the symptoms have usually shown). However, nothing is printed out. I really appreciate your effort to sort this out and we would be glad to help fix this potential bug (or try to find the cause for the strange behavior). You might be right that it might not be a deadlock, but something is causing this behavior. @mhkay: Sadly, we cannot provide too many details about what we are doing in public, because it is a research project. However, we would be glad to confer via email (or some form of more private communication), which would allow us to be more detailed about what we are doing - if you're interested.

    (1-14/14)

    Please register to reply