Project

Profile

Help

Bug #4982

Loading xml schemas which are stored inside a zip archive is very slow compared to Xerces

Added by Tomas Vanhala 6 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
2021-05-04
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
10, trunk
Fix Committed on Branch:
10, trunk
Fixed in Maintenance Release:
Platforms:

Description

We have an in-house application which, prior to processing an xml document, validates it against the xml schema. The xml schemas are stored in zip archives, which we obtain "out of band" from associated parties. We make the schemas available to our application by copying the zip archives to an appropriate location (file path).

We have used Xerces for validation, but now we wish to move to Saxon-EE. We have discovered that Saxon is very slow when loading the xml schema files. We are using version 10.3.

I have attached a small demo application that measures the time it takes for the Xerces and Saxon implementations to load a set of xml schema files. (You will need to adjust the paths to the zip file and the license file.)

Apart from Saxon being very slow, we also observe that Saxon calls LSResourceResolver more often than Xerces.

Can the performance of Saxon be improved?

SaxonBugDemo.java (8.69 KB) SaxonBugDemo.java small demo app Tomas Vanhala, 2021-05-04 10:01
SchemaValidatorImplTest_zipped_schemas.zip (31.9 KB) SchemaValidatorImplTest_zipped_schemas.zip schemas used by the demo app Tomas Vanhala, 2021-05-04 10:02
SaxonBugDemo.zip (29.8 MB) SaxonBugDemo.zip The ZIP archive containing refined demo code with all dependencies included. Tomas Vanhala, 2021-06-07 16:45
SaxonStack.pdf (493 KB) SaxonStack.pdf Tomas Vanhala, 2021-07-29 16:16
SaxonBugDemov2.zip (29.9 MB) SaxonBugDemov2.zip Demo code with improved packaging Tomas Vanhala, 2021-08-02 16:36
demo.log (44.9 KB) demo.log Log created by running the demo code Tomas Vanhala, 2021-08-02 16:45

History

#1 Updated by Michael Kay 6 months ago

Thanks, we'll take a look at this.

From a very quick first glance, my immediate reactions are:

(a) do you really need to set the MULTIPLE_SCHEMA_IMPORTS option? Because this is going to do what it says: read the same schema document multiple times. You should only need it if you have several schema documents with the same target namespace, and that's not really good practice.

(b) there are a number of instances of maxOccurs="99", or "999", or even "9999". The classic algorithm for building a finite state machine with such rules is very expensive (it's exponential in both time and space). Saxon tries to optimise it when it can by using counters, but it's not always possible and I will check to see how these cases are being handled. (Xerces has the same problem, and I think that it sometimes gives up and treats the constraint as if it were maxOccurs="unbounded"). If we do find a problem here, the best solution might be to replace the maxOccurs with an xs:assert.

#2 Updated by Tomas Vanhala 6 months ago

Thank you for the initial comments.

  1. The schema documents have been authored (by an associated party) as follows: Each xsd file which has the filename prefix "NctsDme_FITransit" (13 files) defines an "xml message", and these files share the same target namespace.

Each one of the mentioned 13 files includes and imports the same set of xsd files.

We need to set MULTIPLE_SCHEMA_IMPORTS because due to the schema design, the same xsd files are imported multiple times.

  1. About maxOccurs: We are not able to influence the maxOccurs values. The main reason we wish to move to Saxon is because of this optimisation you mention.

#3 Updated by Michael Kay 6 months ago

I've been trying to get this to run without success. I don't think I understand the strategy for URI resolution. Should all schema documents be found within the ZIP file, or is the external directory also relevant (perhaps it's just a copy of what's in the ZIP file?)

I'm pretty sure the fact that the files are in a ZIP archive isn't relevant to the problem, and just complicates the repro.

Is there a single root schema document that includes/imports all the others? It seems to start by supplying a long list of independent schema documents.

#4 Updated by Michael Kay 6 months ago

  • Status changed from New to AwaitingInfo

#5 Updated by Tomas Vanhala 5 months ago

The demo application which we created tries to load the schema documents from within the ZIP file.

I assume that for Saxon-EE the fact that the files reside in a ZIP archive is not relevant. However, the ZIP archive could make loading a file more "expensive". This could in part explain the significantly increased loading time compared to Xerces.

The demo we provided earlier had some dependencies to our internal code. Please find an uploaded ZIP archive containing refined demo code with all dependencies included. To invoke it:

javac -cp ".:./saxon-ee-10.3.jar:./rt.jar:" SaxonBugDemo.java java -cp ".:./saxon-ee-10.3.jar:./rt.jar:./xercesImpl-2.12.1.jar" SaxonBugDemo

You need to modify the path to the license file before compiling.

About the content of the ZIP archive containing the xsd files:

There are indeed a number of root schema documents. Inside the ZIP archive, they are in the directory external/ncts/dme/v1_6. The filenames of the root schema documents start with "NctsDme_FI". An example of such a root schema document is:

NctsDme_FITransitDeclaration.xsd

It contains the following:

<xs:include schemaLocation="NctsDme_QualifiedType.xsd"/>
<xs:import namespace="http://tulli.fi/schema/external/common/dme/v1_2/qdt" schemaLocation="../../../common/dme/v1_2/Dme_QualifiedType.xsd"/>
<xs:import namespace="http://tulli.fi/schema/external/common/dme/v1_2/cdt" schemaLocation="../../../common/dme/v1_2/Dme_CodeListType.xsd"/>
<xs:import namespace="http://tulli.fi/schema/external/common/dme/v1_2/udt" schemaLocation="../../../common/dme/v1_2/Dme_UnqualifiedType.xsd"/>

In this sample ZIP archive, in total there are 13 root schema documents, each one of them does the same includes/imports.

As I noted earlier, the number of times Saxon calls LSResourceResolver (due to includes/imports) is much greater than Xerces,

#7 Updated by Michael Kay 5 months ago

  • Status changed from AwaitingInfo to In Progress

#8 Updated by Tomas Vanhala 3 months ago

While waiting for further progress, we did some investigation by using a debugger. Looking at the call stack, we can see that the SchemaReader class is called recursively with the same parameters.

Perhaps some caching could be added to the SchemaReader or some other class upper in the call stack? The caching would have the purpose that for an included/imported xsd file, it would not be necessary to resolve the physical location of the file reread it repetitively.

#9 Updated by Michael Kay 3 months ago

As I explained before, I think it is reading the same schema document repeatedly because of the MULTIPLE_SCHEMA_IMPORTS option. This has the effect of overriding some of the caching that is normally being done.

My previous attempt to investigate this didn't get very far because I wasn't able to simulate what you are actually trying to do. If you could make another attempt to explain clearly what I need to do to reproduce the effect you are seeing, that would be very helpful.

#10 Updated by Tomas Vanhala 3 months ago

Perhaps we have not fully understood MULTIPLE_SCHEMA_IMPORTS. We shall test whether the demo code which we provided to you also works with the provided sample data when MULTIPLE_SCHEMA_IMPORTS set to false.

We shall then attempt to give better instructions on how to run the demo code.

The purpose of the demo code is to simulate a module of our application. Our application processes xml "messages" (documents), and one processing step is validation against the xml schema of the message.

In an earlier update I wrote that the sample data provided contains a "root" schema document named NctsDme_FITransitDeclaration.xsd, and 12 other such documents.

When our application receives a message with the root element "FITransitDeclaration" in the namespace "http://tulli.fi/schema/external/ncts/dme/v1", then it will validate the xml against the schema defined in the file NctsDme_FITransitDeclaration.xsd.

We do not have any issues with the validation itself. Our issue is that when using Saxon-EE, our application is extremely slow to start up due to the amount of time it takes to load the xml schemas for the various types of "messages" which we must be ready to process.

#11 Updated by Michael Kay 3 months ago

You should only need to use MULTIPLE_SCHEMA_IMPORTS if you have more than one schema document with the same targetNamespace. My recommendation would be to have a one-to-one mapping between targetNamespace and schema document; if you can't achieve that, then structure it so that for each targetNamespace, there is one document (the "master" document for the namespace) that references the others using xs:include, and when you use xs:import for a namespace, always give the location of the master document. If you do that, you shouldn't need MULTIPLE_SCHEMA_IMPORTS, and Saxon can then know that if it's already imported a namespace, it has loaded all the schema documents for that namespace.

#12 Updated by Tomas Vanhala 3 months ago

Thank you for the guidance on this. We shall verify whether the schemas which we need to work with match this criteria.

Can you comment on the sample data which we provided to you in the file SchemaValidatorImplTest_zipped_schemas.zip.

The zip archive contains the file "external/ncts/dme/v1_6/FITransitDeclaration.xsd", the target namespace of which is "http://tulli.fi/schema/external/ncts/dme/v1".

The zip archive also contains other files, e.g. "external/ncts/dme/v1_6/FITransitUnloadReport.xsd, the target namespace of which is likewise ""http://tulli.fi/schema/external/ncts/dme/v1".

These are both "parallel" master documents for the namespace. They could be replaced with one single master document, but the schema author has chosen this approach.

Is this approach compatible or incompatible with MULTIPLE_SCHEMA_IMPORTS ?

#13 Updated by Tomas Vanhala 3 months ago

With reference to the previous, we did some testing. With the sample schemas which are attached to this issue, we are indeed able to set MULTIPLE_SCHEMA_IMPORTS to false. Thank you for suggesting that.

When setting MULTIPLE_SCHEMA_IMPORTS to false, the xml schemas which are loaded by our application code generate quite many warnings like this:

Warning at xs:import on line 5 column 142 of NctsDme_QualifiedType.xsd: SXWN9009 The schema document at ../../../common/dme/v1_2/Dme_QualifiedType.xsd is not being read because schema components for this namespace are already available. To force the schema document to be read, set --multipleSchemaImports:on

This can be demonstrated by the sample schemas and the output produced by our demo. When running the demo, the first master document is loaded:

Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI null, publicId null, systemId NctsDme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd

As the master document is loaded, all imports are processed by loading files as expected (per schemaLocation).

Then, the next master document is loaded:

Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI null, publicId null, systemId NctsDme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitError.xsd

Now for each imported xsd, a SXWN9009 warning is generated.

Based on the previous discussion, this warning can perhaps be considered to be a positive confirmation that the caching of already loaded xsd files in Saxon is working as intended. We also note that the time required for Saxon to load the sample schemas was reduced by approximately 20% compared to MULTIPLE_SCHEMA_IMPORTS set to true.

However, we are still having a serious performance drawback compared to the Xerces implementation which we have been using up to now.

Our sample code first loads the sample schemas using the Xerces implementation, and on my test run does that in 1946 milliseconds.

It then proceeds to load the same sample schemas using the Saxon implementation, and on my test run does that in 8814 milliseconds.

With our real application and a much larger set of schemas, we are facing a situation where the performance disadvantage is much more severe than 4x. When using the Saxon implementation, we cannot successfully start up our application in our integration environment. On each attempt, we have had to terminate our application after a few hours of startup time. We are also looking into whether Saxon might consume significantly more heap than Xerces, and we therefore could be too tight on Java heap or server RAM.

We would appreciate if you could compare the Saxon and Xerces JAXP implementations and analyse Saxon’s significant performance disadvantage. We hope our demo code can be of help.

How to run our demo code:

  1. Put these files into the same directory:

SaxonBugDemov2.zip SchemaValidatorImplTest_zipped_schemas.zip saxon-license.lic

  1. Excecute the following commands:

unzip SaxonBugDemov2.zip cd SaxonBugDemo javac -cp ".:./saxon-ee-10.3.jar:./rt.jar:" SaxonBugDemo.java java -cp ".:./saxon-ee-10.3.jar:./rt.jar:./xercesImpl-2.12.1.jar" SaxonBugDemo

#14 Updated by Tomas Vanhala 3 months ago

We have done some further work on our side. Our application can now be deployed successfully into our Integration Test environment when using the Saxon JAXP implementation.

We now have MULTIPLE_SCHEMA_IMPORTS set to false, which did improve Saxon performance. However, there still remains significant impact on the application startup time compared to Xerces.

In our Integration Test environment, during startup our application loads 70 xml schema sets similar to the one included with our demo code (attachment SchemaValidatorImplTest_zipped_schemas.zip).

When our applicaton starts up using the Xerces JAXP implementation, loading of the 70 xml schema sets takes 2 minutes.

When our application starts up using the Saxon JAXP implementation, loadling of the same 70 xml schema sets takes 16 minutes 35 seconds.

We have made the selection of the JAXP selection a configurable parameter. The above figures have been obtained using the same software build, on the same server.

We would be delighted, if someone with access to Saxon sources analysed why Xerces is so much more efficient in this regard.

#15 Updated by Michael Kay 3 months ago

I'm now running the repro successfully, thanks.

You're populating schemaFactory.newSchema() with an array containing 18 StreamSource objects referring to schema documents in the ZIP file. Between them these files contain 13 xs:include directives and 46 xs:imports.

Xerces calls the resolver just 4 times.

I wrote a subclass of ByteArrayInputStream that monitors when each of the streams is closed. For Xerces this shows that the four resolver calls actually occur while processing the first schema document:

Closing zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/ncts/dme/v1, publicId null, systemId NctsDme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolved xsd file content: publicId null systemId zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/qdt, publicId null, systemId ../../../common/dme/v1_2/Dme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/udt, publicId null, systemId Dme_UnqualifiedType.xsd, baseURI zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_UnqualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/cdt, publicId null, systemId Dme_CodeListType.xsd, baseURI zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_CodeListType.xsd baseURI null
Closing zip:/external/ncts/dme/v1_6/NctsDme_FITransitError.xsd
Closing zip:/external/ncts/dme/v1_6/NctsDme_FITransitRelease.xsd
... etc (closing the remaining files)

The first schema document processed is NctsDme_FITransitArrivalNotification.xsd, and the four calls on the resolver correspond to the four xs:include and xs:import directives in that document.

The second schema document processed is NctsDme_FITransitError.xsd. This has exactly the same xs:include and xs:import directives. Presumably Xerces resolves these from cache, and likewise for all subsequent includes and imports, therefore making no further calls on the resolver.

By contrast, Saxon calls the resolver 101 times. It makes 10 calls while processing the first document:

Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI null, publicId null, systemId NctsDme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolved xsd file content: publicId null systemId zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/qdt, publicId null, systemId ../../../common/dme/v1_2/Dme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/udt, publicId null, systemId Dme_UnqualifiedType.xsd, baseURI zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_UnqualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/cdt, publicId null, systemId Dme_CodeListType.xsd, baseURI zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_CodeListType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/udt, publicId null, systemId Dme_UnqualifiedType.xsd, baseURI zip:/external/common/dme/v1_2/Dme_CodeListType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_UnqualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/cdt, publicId null, systemId ../../../common/dme/v1_2/Dme_CodeListType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_CodeListType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/udt, publicId null, systemId ../../../common/dme/v1_2/Dme_UnqualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_QualifiedType.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_UnqualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/qdt, publicId null, systemId ../../../common/dme/v1_2/Dme_QualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_QualifiedType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/cdt, publicId null, systemId ../../../common/dme/v1_2/Dme_CodeListType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_CodeListType.xsd baseURI null
Resolver called with parameters type http://www.w3.org/2001/XMLSchema, namespaceURI http://tulli.fi/schema/external/common/dme/v1_2/udt, publicId null, systemId ../../../common/dme/v1_2/Dme_UnqualifiedType.xsd, baseURI zip:/external/ncts/dme/v1_6/NctsDme_FITransitArrivalNotification.xsd
Resolved xsd file content: publicId null systemId zip:/external/common/dme/v1_2/Dme_UnqualifiedType.xsd baseURI null

Clearly many of these requests are duplicates. So why isn't Saxon finding the files in cache?

#16 Updated by Michael Kay 3 months ago

It turns out Saxon is first calling the resolver to get a document, then it is checking to see whether the document is already in cache.

I suspect the original reasoning for this was that the LSResourceResolver interface supplies the (relative) systemId and the baseURI as separate arguments, implying that the LSResourceResolver is not obliged to return the same resource every time the same absolutised URI is called for. However, in cases where the LSResourceResolver is actually fetching a document from disk or from the network, this is clearly a very inefficient way of handling duplicates.

The problem only arises with a user-written LSResourceResolver, because if Saxon's standard SchemaURIResolver is used, this checks in the cache before accessing any external resources. (This avoids the redundancy of absolutising the URI twice, once before calling the resolver and once within the resolver implementation. But this is a fairly minor saving.)

In fact it looks as if the original design assumed that caching was the responsibility of the resolver, not of Saxon itself.

#17 Updated by Michael Kay 3 months ago

Fixing the code for xs:import and xs:include to look in the cache before calling the resolver makes surprisingly little difference. We still have 95 resolver calls, and it still takes 1600ms (compared with 613ms for Xerces).

We now have 4 resolver calls while processing the first schema document, matching those in Xerces.

When we process the second schema document (supplied in the array of 18), includes and imports are not found in the cache. The cache is held within a com.saxonica.ee.schema.SchemaCompiler object, and a new one of these is created for each of the 18 supplied documents. It's not clear whether there is any good reason for this.

#18 Updated by Michael Kay 3 months ago

Fixing this (rather crudely for the time being) so that we use the same SchemaCompiler throughout brings the number of calls to the resolver down to 4, and reduces the total compile time to 1204ms - still slower than the Xerces time, but hopefully now more acceptable.

In fact a lot of this time is Java warm-up time. If we run the whole thing 100 times, the time for later runs is just 125ms.

There are probably many improvements that can be made to compile-time schema performance, but I don't think they come under this issue.

#19 Updated by Michael Kay 3 months ago

Running Xerces and Saxon alternating, 100 times each to eliminate JVM warm-up effects, I'm now seeing typical timings of 127ms for Xerces and 163ms for Saxon. I'll now regression-test this change, and if no problems are found, deem it solved.

#20 Updated by Michael Kay 3 months ago

The XSD test suite is now showing

Result: 39350 passed, 43 failed

so more investigation is needed.

False alarm. I was running the tests with -v1.0, i.e. expecting XSD 1.0 behaviour, but Saxon 10.x no longer supports XSD 1.0 behaviour where it is different (most of these are tests that are expected to fail under 1.0 and succeed under 1.1).

With -v1.1 I get

Result: 41537 passed, 0 failed

None of these tests, of course, is testing the API which is where the changes have been made.

#21 Updated by Michael Kay 3 months ago

  • Status changed from In Progress to Resolved
  • Applies to branch 10, trunk added
  • Fix Committed on Branch 10, trunk added

#22 Updated by Tomas Vanhala 3 months ago

After the changes, your testing show typical timings where Saxon consumes only moderately more time than Xerces (approximately 28%, 163ms vs 127ms). We find the improved performance level to be fully acceptable for our purposes.

Thank you very much for the work. Will this be included in the next Saxon 10.x maintenance release?

#23 Updated by Michael Kay 3 months ago

Yes, these fixes will be in the next 10.x maintenance release. We don't yet have a schedule for that, but the pressure is building up.

#24 Updated by O'Neil Delpratt about 1 month ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 10.6 added

Bug fix applied in the Saxon 10.6 maintenance release

Please register to edit this issue

Also available in: Atom PDF