Bug #4507
closedTinyTree used for assertion checking during schema validation includes excessive NamespaceBinding objects
Added by Michael Kay over 4 years ago. Updated about 4 years ago.
100%
Description
Reported by private email: a validation run, which appears to have an assertion defined on the root element node, is consuming excessive memory. Heap analysis shows a large number of redundant NamespaceBinding objects, accounting for a large proportion of the tree size.
This is Saxon 9.9 on .NET.
The data is confidential medical data so we don't have a repro.
Updated by Michael Kay over 4 years ago
I've been doing some experiments.
First: validating a document of the form
<x:root xmlns:x="http://xxxxxxxxxxx.com/ns">
<a/>
(many repetitions)
<a/>
</x:root>
with a single assertion defined on the root element.
Under 10.0, the tree used for assertion checking contains a single NamespaceMap
object, as one might expect.
With 9.9, the tree contains 4 NamespaceBinding objects, which is rather wierd. Two are for the "x" namespace; two for the "xml" namespace. All four are associated with the root element. This happens because the validator is issuing the namespace events for the root element twice: once from ValidationStack#263, when the root element is constructed, and once from StartTagBuffer.startContent()
. This is rather peculiar, but it doesn't seem to cause any problems, and it doesn't account for the thousands of namespace binding objects observed in the heap dump. In particular, the number of NamespaceBinding
objects doesn't increase with the number of <a/>
elements.
I then tried a document of the form:
<root xmlns="http://xxxxxxxxxxx.com/ns">
<a xmlns:x="http://xxxxxxxxxxx.com/ns"/>
(repeated)
<a xmlns:x="http://xxxxxxxxxxx.com/ns"/>
</root>
In 10.0, the resulting TinyTree contained just two NamespaceMap
objects: one holding the single binding xmlns="http://xxxxxxxxxxx.com/ns"
, the other holding two bindings: xmlns="http://xxxxxxxxxxx.com/ns" xmlns:x="http://xxxxxxxxxxx.com/ns"
. The code has successfully shared the NamespaceMap
for different elements with the same in-scope namespaces.
In 9.9, the tree holds 2+n NamespaceBinding
objects, where n
is the number of <a/>
elements. There is no attempt to common these up.
Conclusion so far: the 10.0 design handles this case a lot better than 9.9; but 9.9 is (with a minor glitch, which doesn't seem relevant to the problem) working as designed. The large number of NamespaceBinding
objects observed in the heap dump might simply be due to the presence of a large number of namespace declarations in the source document.
Updated by Michael Kay over 4 years ago
- Status changed from New to In Progress
I now have a sample document to work with, containing anonymised data.
Initial observations:
-
The root element is
MedicalAgreement
, which has an assertion defined, therefore we are building a tree containing the entire document. -
The top level assertion only looks at attribute nodes: if we were able to detect this, we would avoid the need to build this tree.
-
Lots of lower level element also have assertions, so we are creating a lot of VirtualCopy objects
-
There are only two namespace declarations in the document (, and the number of NamespaceBinding objects in the tree is therefore very low.
-
Memory requirement is about 585Mb in 9.9; no improvement in 10.0
Updated by Michael Kay over 4 years ago
I've now drilled down a bit using the IDE debugger and JProfiler heap analysis. This is on the Java platform; we'll go on to confirm the results on .NET.
Everything here seems to be working as designed. Because there's an assertion on the root element, we are building the entire tree in memory. Saxon on the command line reports total memory used as 550Mb. JProfiler gives a somewhat lower figure. The bulk of the memory is occupied by a single TinyTree instance, which is created when the first assertion is encountered (on the root element).
I haven't been able to reproduce the measurement (supplied by the client) than shows a very large number (9m) of NamespaceBinding objects in the heap. In fact, I've added a counter and we're only creating 23 NamespaceBinding objects in the course of the validation.
Updated by Michael Kay over 4 years ago
We now have a new schema file and data file to work with, and I have been doing measurements on this. On the Java platform I'm seeing a validation time of around 20s and memory usage about 1100Mb as reported when run from the command line. The number of NamespaceBinding
objects created is modest (with the schema as written, it's about 10; with a modified schema that drops the "global" assertions, it increases to 1611775, because we are now building lots of small trees rather than one large one). This is still well short of the number observed in the .NET heap dump.
The number of NamespaceBinding
objects in the .NET heap dump (9,285,844) is roughly the same as the number of elements in the source document (9,324,026). This suggests to me that we may be storing one namespace per element in the constructed tree - perhaps a copy of the namespace declared on the root element, or perhaps the XML namespace. Since this accounts for 531Mb of the space occupancy, it's certainly something we need to investigate.
I found no significant difference when changing the assertion on line 50 to StartDate le ChangeDate or true()
, so that there are no validation errors. Removing the "global" assertions (on types MedicalAgreementType
and ProvidersType
) gives a memory reduction to 388Mb, as expected because we are now building lots of small trees for assertion testing rather than one large one.
I also ran on 10.0, unfortunately this seems to increase both the memory usage and the elapsed time. This is worth investigating.
Updated by Michael Kay over 4 years ago
We've been running this through the debugger on .NET, and we found that when using the Microsoft pull parser it is indeed adding a namespace declaration for the default namespace xmlns="http://www.ahsa.com.au/schema/medicalagreements"
to every element.
The code that does this is in StartTagBuffer.startElement()
which is the entry point to the validation pipeline, but it only does it if the flag NAMESPACES_OK is not set. With a push parser, the ReceivingContentHandler
sets this flag, but with a pull parser, the PullPushTee
which translates pull events to push events is not setting it. We are now experimenting by setting this flag and seeing if it causes any problems.
Note: workaround - use the JAXP parser instead of the Microsoft parser, which will happen for example if you use SetSource(Stream, Uri)
in preference to SetSource(XmlReader)
.
Updated by Michael Kay over 4 years ago
Did some further testing on 10.0 and the extra time and memory used relative to 9.9 seems to have been a measurement glitch.
I found that on both 9.9 and 10.0, the trace() calls in WatchManager are showing up in the Java profile: although the trace() call does nothing, its arguments are being evaluated, and this involves fairly expensive QName-to-string conversions. I've therefore commented the trace() calls out.
I found that on 9.9, validation with a pull parser on Java (specifically, Woodstox) shows the same effect as using the Microsoft pull parser on .NET, namely 9324037 excess NamespaceBinding objects on the assertion tree; further that the proposed patch in PullPushTee.java fixes this, and causes no obvious regression. (Running Saxon with a StAX parser, however, is woefully under-tested).
Updated by Michael Kay over 4 years ago
Running on 9.9, I found that running with byte code generation disabled (--generateByteCode:off) was faster by 8-10%, but made no difference to memory usage.
The Java profile showed a measurable cost for the HashMap update at XSIAttributeHandler#117, which maintains a list of all namespaces encountered for element and attribute names, and also for the namespaceStatus HashMap maintained by the EnterpriseConfiguration. In both cases we are doing a lot of updates to these HashMaps, most of which don't actually change the existing value; and we are doing very few retrievals. One suspects there is a better way of doing this. But it's a 1% effect, so perhaps not worth bothering with.
The space occupied by line and column numbers, used for diagnostics when assertions fail, is quite large: two arrays of 16,388,096 integers each; a total of about 128Mb.
The .NET heap dump shows 28 instances of IntHashMap occupying a total of 200Mb. I'm having trouble identifying all of these, but the dominant one appears to be the knownBaseUris
table, which contains 1,847,098 entries, but has space for int[8388608] plus Object[8388608]. This is surely a nonsense because all the entries contain the same base URI.
Updated by Michael Kay over 4 years ago
Looking at the knownBaseUris
table, an entry is added to this table when we compute the base URI of an element node; the value is retained because computing it is potentially expensive. The reason we are computing the base URI is not (although it might be) because any of the assertions wants to know the base URI; it is because when we create a VirtualCopy
of a subtree, the system ID of the virtual copy is set to the base URI of the corresponding root node in the real tree. So we should be able to avoid this by computing the system ID of the virtual copy lazily.
It still seems a nonsense that if a million elements in a document have the same base URI, we store a table with a million entries. Of course, the reason we do this is that we don't actually know that they all have the same base URI. But it's easy enough to determine that this must be the case: if all the nodes in a tree have the same system ID (ie, no external entities and no XInclude directives), and if there are no xml:base
attributes, then all the nodes necessarily have the same base URI. It would seem easy enough to check for these two conditions during document building.
Updated by Vladimir Nesterovsky over 4 years ago
The number of NamespaceBinding objects created is modest (with the schema as written, it's about 10; with a modified schema that drops the "global" assertions, it increases to 1611775, because we are now building lots of small trees rather than one large one).
It is interesting to note that this problem calls for another forum thread Saxon Performance. If to define a global pool of NamespaceBinding instances then the number of living instance will be smaller. This is good for single processing within single or multiple trees, and as well to serves if there are multiple processing in parallel.
Updated by Michael Kay over 4 years ago
Validimir, in this instance the redundancy is within a single document, and reusing declarations within a single document (or reusing sets of declarations, as we do in 10.0) is quite sufficient to solve the problem. But thanks for reminding me about that other thread, it does have some interesting ideas.
Updated by Michael Kay over 4 years ago
- Status changed from In Progress to Resolved
- Priority changed from Low to Normal
- Applies to branch 10 added
- Fix Committed on Branch 10, 9.9 added
Resolved with the following changes/improvements.
-
(9.9 only) When using a pull parser (both Java and .NET) we avoid creating the redundant namespace declarations
-
(10 only) The TinyBuilder detects when all elements in the tree have a uniform base URI, and this short-circuits the way getBaseURI() on an element is computed, saving significantly on both time and space.
-
(10 only) The systemId of a VirtualCopy is now computed lazily, avoiding the potentially expensive call on getBaseURI() on the base tree if the system ID is not actually needed. (The benefit of this is reduced now that (2) is implemented, but it's still worth doing).
Updated by O'Neil Delpratt over 4 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 10.1 added
Bug fix committed in the Saxon 10.1 maintenance release.
Updated by O'Neil Delpratt about 4 years ago
- Status changed from Resolved to Closed
- Fixed in Maintenance Release 10.2, 9.9.1.8 added
- Fixed in Maintenance Release deleted (
10.1)
Bug fix applied on the Saxon 9.9.1.8 maintenance release.
Please register to edit this issue