Project

Profile

Help

Bug #1844

Out-of-memory due to non-synchronisation of TinyTree statistics

Added by Michael Kay about 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2013-07-16
Due date:
% Done:

100%

Spent time:
-
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Found in version:
9.5
Fixed in version:
9.5.1.2

Description

Wolfgang Hoschek reports

At company X sometimes (perhaps after a week or so of heavy duty data crunching) we'd see a seemingly random OutOfMemoryError with Saxon in production. After studying the heap dumps I suspected a race condition in TinyTree memory allocation. Examination of the source code confirmed the problem.

The underlying issue is that TinyTree.updateStatistics() updates these 64 bit global vars without any synchronisation:

private static int treesCreated = 5;
private static double averageNodes = 4000.0;
private static double averageAttributes = 100.0;
private static double averageNamespaces = 20.0;
private static double averageCharacters = 4000.0;

If multiple threads are running (unrelated) TinyTree.updateStatistics() at the same time it can happen that averageNodes and cousins behave somewhat like a random number. Java does not guarantee atomic updates to 64 bit variables, only to 32 bit vars. averageNodes is a double, i.e. has 64 bit. So every once in a while a parallel thread sees the "old" low 32 bits and the "new" high 32 bits, or the other way round, and the resulting view of averageNodes is correspondingly off. This way averageNodes might be "visible" as a huge number, which causes "nodes" to be huge number in the TinyTree constructor, which in turn causes an OOM (next = new int[nodes]).

A workaround is to make TinyTree.updateStatistics() synchronized on the class, or to use AtomicDouble, or similar. Back then we applied that fix and our production OOMs disappeared completely.

I'm considering using Saxon again at company Y, but I'm concerned about the race condition leading to random blow ups in production. I looked at the source code of 8.5.1.1 [9.5.1.1? - MK] and the same problem is still present there in there. Any chance this could be fixed for good, e.g. with a bit of synchronization or AtomicDouble or similar?

History

#1 Updated by Michael Kay about 3 years ago

Note also, we could try and be smarter about the way the memory allocation learns from previous experience. We could keep separate stats for different places in the code that cause tree creation, for example a call on doc() or construction of a temporary variable. The current system doesn't allow for the fact that there may be more than one pattern of allocation, with the average values being a poor choice for any of these patterns.

#2 Updated by Wolfgang Hoschek about 3 years ago

Keeping track of the stats per "Processor" instance might be nice, although for my particular use case that doesn't seem necessary.

Here is what I had in mind around using AtomicDouble:

private static final AtomicDouble averageNodes = new AtomicDouble();
double avg = ((averageNodes.get() * n0) + numberOfNodes) / n1;
if (avg < 10.0) {
avg = 10.0;
}
averageNodes.set(avg);

Same pattern would apply for the other static double vars in updateStatistics(). There are probably also many other ways of handling this.

#3 Updated by Michael Kay about 3 years ago

I've added a simple synchronization on the class for 9.5. There doesn't seem to be an AtomicDouble in the JDK, and I'm fed up with the jobsworth lawyers who see it as their role to complain whenever I use some new third-party component. For 9.6 I'm looking at keeping the stats partitioned for different kinds of document (source docs, temporary trees, etc).

#4 Updated by Michael Kay about 3 years ago

  • Status changed from In Progress to Resolved

For 9.6 I've introduced a Statistics object to hold the stats; in the cases where the caller supplies estimated sizes it now supplies its own Statistics object; in other cases one of a number of static Statistics objects is used, e.g. one for source documents and one for temporary trees. The stats are updated with synchronisation on the Statistics object. They are still held in static but we could move to allocating them per Configuration.

#5 Updated by O'Neil Delpratt almost 3 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in version set to 9.5.1.2

Bug fix applied in the Saxon maintenance release 9.5.1.2

Also available in: Atom PDF