Project

Profile

Help

Bug #1844

closed

Out-of-memory due to non-synchronisation of TinyTree statistics

Added by Michael Kay almost 11 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2013-07-16
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Wolfgang Hoschek reports

At company X sometimes (perhaps after a week or so of heavy duty data crunching) we'd see a seemingly random OutOfMemoryError with Saxon in production. After studying the heap dumps I suspected a race condition in TinyTree memory allocation. Examination of the source code confirmed the problem.

The underlying issue is that TinyTree.updateStatistics() updates these 64 bit global vars without any synchronisation:

private static int treesCreated = 5;

private static double averageNodes = 4000.0;

private static double averageAttributes = 100.0;

private static double averageNamespaces = 20.0;

private static double averageCharacters = 4000.0;

If multiple threads are running (unrelated) TinyTree.updateStatistics() at the same time it can happen that averageNodes and cousins behave somewhat like a random number. Java does not guarantee atomic updates to 64 bit variables, only to 32 bit vars. averageNodes is a double, i.e. has 64 bit. So every once in a while a parallel thread sees the "old" low 32 bits and the "new" high 32 bits, or the other way round, and the resulting view of averageNodes is correspondingly off. This way averageNodes might be "visible" as a huge number, which causes "nodes" to be huge number in the TinyTree constructor, which in turn causes an OOM (next = new int[nodes]).

A workaround is to make TinyTree.updateStatistics() synchronized on the class, or to use AtomicDouble, or similar. Back then we applied that fix and our production OOMs disappeared completely.

I'm considering using Saxon again at company Y, but I'm concerned about the race condition leading to random blow ups in production. I looked at the source code of 8.5.1.1 [9.5.1.1? - MK] and the same problem is still present there in there. Any chance this could be fixed for good, e.g. with a bit of synchronization or AtomicDouble or similar?

Please register to edit this issue

Also available in: Atom PDF