Bug #4507: TinyTree used for assertion checking during schema validation includes excessive NamespaceBinding objects - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #4507

closed

TinyTree used for assertion checking during schema validation includes excessive NamespaceBinding objects

Added by Michael Kay over 4 years ago. Updated about 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2020-03-30

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

10, 9.9

Fix Committed on Branch:

10, 9.9

Fixed in Maintenance Release:

10.2, 9.9.1.8

Platforms:

Description

Reported by private email: a validation run, which appears to have an assertion defined on the root element node, is consuming excessive memory. Heap analysis shows a large number of redundant NamespaceBinding objects, accounting for a large proportion of the tree size.

This is Saxon 9.9 on .NET.

The data is confidential medical data so we don't have a repro.

Actions

Copy link

Updated by Michael Kay over 4 years ago

I've been doing some experiments.

First: validating a document of the form

<x:root xmlns:x="http://xxxxxxxxxxx.com/ns">
  <a/>
  (many repetitions)
  <a/>
</x:root>

with a single assertion defined on the root element.

Under 10.0, the tree used for assertion checking contains a single NamespaceMap object, as one might expect.

With 9.9, the tree contains 4 NamespaceBinding objects, which is rather wierd. Two are for the "x" namespace; two for the "xml" namespace. All four are associated with the root element. This happens because the validator is issuing the namespace events for the root element twice: once from ValidationStack#263, when the root element is constructed, and once from StartTagBuffer.startContent(). This is rather peculiar, but it doesn't seem to cause any problems, and it doesn't account for the thousands of namespace binding objects observed in the heap dump. In particular, the number of NamespaceBinding objects doesn't increase with the number of <a/> elements.

I then tried a document of the form:

<root xmlns="http://xxxxxxxxxxx.com/ns">
  <a xmlns:x="http://xxxxxxxxxxx.com/ns"/>
  (repeated)
  <a xmlns:x="http://xxxxxxxxxxx.com/ns"/>
</root>

In 10.0, the resulting TinyTree contained just two NamespaceMap objects: one holding the single binding xmlns="http://xxxxxxxxxxx.com/ns", the other holding two bindings: xmlns="http://xxxxxxxxxxx.com/ns" xmlns:x="http://xxxxxxxxxxx.com/ns". The code has successfully shared the NamespaceMap for different elements with the same in-scope namespaces.

In 9.9, the tree holds 2+n NamespaceBinding objects, where n is the number of <a/> elements. There is no attempt to common these up.

Conclusion so far: the 10.0 design handles this case a lot better than 9.9; but 9.9 is (with a minor glitch, which doesn't seem relevant to the problem) working as designed. The large number of NamespaceBinding objects observed in the heap dump might simply be due to the presence of a large number of namespace declarations in the source document.

Actions

Copy link

Updated by Michael Kay over 4 years ago

Status changed from New to In Progress

I now have a sample document to work with, containing anonymised data.

Initial observations:

The root element is MedicalAgreement, which has an assertion defined, therefore we are building a tree containing the entire document.
The top level assertion only looks at attribute nodes: if we were able to detect this, we would avoid the need to build this tree.
Lots of lower level element also have assertions, so we are creating a lot of VirtualCopy objects
There are only two namespace declarations in the document (, and the number of NamespaceBinding objects in the tree is therefore very low.
Memory requirement is about 585Mb in 9.9; no improvement in 10.0

Actions

Copy link

Updated by Michael Kay over 4 years ago

I've now drilled down a bit using the IDE debugger and JProfiler heap analysis. This is on the Java platform; we'll go on to confirm the results on .NET.

Everything here seems to be working as designed. Because there's an assertion on the root element, we are building the entire tree in memory. Saxon on the command line reports total memory used as 550Mb. JProfiler gives a somewhat lower figure. The bulk of the memory is occupied by a single TinyTree instance, which is created when the first assertion is encountered (on the root element).

I haven't been able to reproduce the measurement (supplied by the client) than shows a very large number (9m) of NamespaceBinding objects in the heap. In fact, I've added a counter and we're only creating 23 NamespaceBinding objects in the course of the validation.

Actions

Copy link

Updated by Michael Kay over 4 years ago

We now have a new schema file and data file to work with, and I have been doing measurements on this. On the Java platform I'm seeing a validation time of around 20s and memory usage about 1100Mb as reported when run from the command line. The number of NamespaceBinding objects created is modest (with the schema as written, it's about 10; with a modified schema that drops the "global" assertions, it increases to 1611775, because we are now building lots of small trees rather than one large one). This is still well short of the number observed in the .NET heap dump.

The number of NamespaceBinding objects in the .NET heap dump (9,285,844) is roughly the same as the number of elements in the source document (9,324,026). This suggests to me that we may be storing one namespace per element in the constructed tree - perhaps a copy of the namespace declared on the root element, or perhaps the XML namespace. Since this accounts for 531Mb of the space occupancy, it's certainly something we need to investigate.

I found no significant difference when changing the assertion on line 50 to StartDate le ChangeDate or true(), so that there are no validation errors. Removing the "global" assertions (on types MedicalAgreementType and ProvidersType) gives a memory reduction to 388Mb, as expected because we are now building lots of small trees for assertion testing rather than one large one.

I also ran on 10.0, unfortunately this seems to increase both the memory usage and the elapsed time. This is worth investigating.

Actions

Copy link

Updated by Michael Kay over 4 years ago

We've been running this through the debugger on .NET, and we found that when using the Microsoft pull parser it is indeed adding a namespace declaration for the default namespace xmlns="http://www.ahsa.com.au/schema/medicalagreements" to every element.

The code that does this is in StartTagBuffer.startElement() which is the entry point to the validation pipeline, but it only does it if the flag NAMESPACES_OK is not set. With a push parser, the ReceivingContentHandler sets this flag, but with a pull parser, the PullPushTee which translates pull events to push events is not setting it. We are now experimenting by setting this flag and seeing if it causes any problems.

Note: workaround - use the JAXP parser instead of the Microsoft parser, which will happen for example if you use SetSource(Stream, Uri) in preference to SetSource(XmlReader).

Actions

Copy link

Updated by Michael Kay over 4 years ago

Did some further testing on 10.0 and the extra time and memory used relative to 9.9 seems to have been a measurement glitch.

I found that on both 9.9 and 10.0, the trace() calls in WatchManager are showing up in the Java profile: although the trace() call does nothing, its arguments are being evaluated, and this involves fairly expensive QName-to-string conversions. I've therefore commented the trace() calls out.

I found that on 9.9, validation with a pull parser on Java (specifically, Woodstox) shows the same effect as using the Microsoft pull parser on .NET, namely 9324037 excess NamespaceBinding objects on the assertion tree; further that the proposed patch in PullPushTee.java fixes this, and causes no obvious regression. (Running Saxon with a StAX parser, however, is woefully under-tested).

Actions

Copy link

#10

Updated by Michael Kay over 4 years ago

Running on 9.9, I found that running with byte code generation disabled (--generateByteCode:off) was faster by 8-10%, but made no difference to memory usage.

The Java profile showed a measurable cost for the HashMap update at XSIAttributeHandler#117, which maintains a list of all namespaces encountered for element and attribute names, and also for the namespaceStatus HashMap maintained by the EnterpriseConfiguration. In both cases we are doing a lot of updates to these HashMaps, most of which don't actually change the existing value; and we are doing very few retrievals. One suspects there is a better way of doing this. But it's a 1% effect, so perhaps not worth bothering with.

The space occupied by line and column numbers, used for diagnostics when assertions fail, is quite large: two arrays of 16,388,096 integers each; a total of about 128Mb.

The .NET heap dump shows 28 instances of IntHashMap occupying a total of 200Mb. I'm having trouble identifying all of these, but the dominant one appears to be the knownBaseUris table, which contains 1,847,098 entries, but has space for int[8388608] plus Object[8388608]. This is surely a nonsense because all the entries contain the same base URI.

Actions

Copy link

#11

Updated by Michael Kay over 4 years ago

Looking at the knownBaseUris table, an entry is added to this table when we compute the base URI of an element node; the value is retained because computing it is potentially expensive. The reason we are computing the base URI is not (although it might be) because any of the assertions wants to know the base URI; it is because when we create a VirtualCopy of a subtree, the system ID of the virtual copy is set to the base URI of the corresponding root node in the real tree. So we should be able to avoid this by computing the system ID of the virtual copy lazily.

It still seems a nonsense that if a million elements in a document have the same base URI, we store a table with a million entries. Of course, the reason we do this is that we don't actually know that they all have the same base URI. But it's easy enough to determine that this must be the case: if all the nodes in a tree have the same system ID (ie, no external entities and no XInclude directives), and if there are no xml:base attributes, then all the nodes necessarily have the same base URI. It would seem easy enough to check for these two conditions during document building.

Actions

Copy link

#12

Updated by Vladimir Nesterovsky over 4 years ago

The number of NamespaceBinding objects created is modest (with the schema as written, it's about 10; with a modified schema that drops the "global" assertions, it increases to 1611775, because we are now building lots of small trees rather than one large one).

It is interesting to note that this problem calls for another forum thread Saxon Performance. If to define a global pool of NamespaceBinding instances then the number of living instance will be smaller. This is good for single processing within single or multiple trees, and as well to serves if there are multiple processing in parallel.

Actions

Copy link

#13

Updated by Michael Kay over 4 years ago

Validimir, in this instance the redundancy is within a single document, and reusing declarations within a single document (or reusing sets of declarations, as we do in 10.0) is quite sufficient to solve the problem. But thanks for reminding me about that other thread, it does have some interesting ideas.

Actions

Copy link

#14

Updated by Michael Kay over 4 years ago

Status changed from In Progress to Resolved
Priority changed from Low to Normal
Applies to branch 10 added
Fix Committed on Branch 10, 9.9 added

Resolved with the following changes/improvements.

(9.9 only) When using a pull parser (both Java and .NET) we avoid creating the redundant namespace declarations
(10 only) The TinyBuilder detects when all elements in the tree have a uniform base URI, and this short-circuits the way getBaseURI() on an element is computed, saving significantly on both time and space.
(10 only) The systemId of a VirtualCopy is now computed lazily, avoiding the potentially expensive call on getBaseURI() on the base tree if the system ID is not actually needed. (The benefit of this is reduced now that (2) is implemented, but it's still worth doing).

Actions

Copy link

#15

Updated by O'Neil Delpratt over 4 years ago

% Done changed from 0 to 100
Fixed in Maintenance Release 10.1 added

Bug fix committed in the Saxon 10.1 maintenance release.

Actions

Copy link

#16

Updated by O'Neil Delpratt about 4 years ago

Status changed from Resolved to Closed
Fixed in Maintenance Release 10.2, 9.9.1.8 added
Fixed in Maintenance Release deleted (~~10.1~~)

Bug fix applied on the Saxon 9.9.1.8 maintenance release.

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #4507

TinyTree used for assertion checking during schema validation includes excessive NamespaceBinding objects

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Vladimir Nesterovsky over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by Michael Kay over 4 years ago

Updated by O'Neil Delpratt over 4 years ago

Updated by O'Neil Delpratt about 4 years ago