Project

Profile

Help

Bug #4982

Loading xml schemas which are stored inside a zip archive is very slow compared to Xerces

Added by Tomas Vanhala 13 days ago. Updated 2 days ago.

Status:
AwaitingInfo
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
2021-05-04
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:

Description

We have an in-house application which, prior to processing an xml document, validates it against the xml schema. The xml schemas are stored in zip archives, which we obtain "out of band" from associated parties. We make the schemas available to our application by copying the zip archives to an appropriate location (file path).

We have used Xerces for validation, but now we wish to move to Saxon-EE. We have discovered that Saxon is very slow when loading the xml schema files. We are using version 10.3.

I have attached a small demo application that measures the time it takes for the Xerces and Saxon implementations to load a set of xml schema files. (You will need to adjust the paths to the zip file and the license file.)

Apart from Saxon being very slow, we also observe that Saxon calls LSResourceResolver more often than Xerces.

Can the performance of Saxon be improved?

SaxonBugDemo.java (8.69 KB) SaxonBugDemo.java small demo app Tomas Vanhala, 2021-05-04 10:01
SchemaValidatorImplTest_zipped_schemas.zip (31.9 KB) SchemaValidatorImplTest_zipped_schemas.zip schemas used by the demo app Tomas Vanhala, 2021-05-04 10:02

History

#1 Updated by Michael Kay 13 days ago

Thanks, we'll take a look at this.

From a very quick first glance, my immediate reactions are:

(a) do you really need to set the MULTIPLE_SCHEMA_IMPORTS option? Because this is going to do what it says: read the same schema document multiple times. You should only need it if you have several schema documents with the same target namespace, and that's not really good practice.

(b) there are a number of instances of maxOccurs="99", or "999", or even "9999". The classic algorithm for building a finite state machine with such rules is very expensive (it's exponential in both time and space). Saxon tries to optimise it when it can by using counters, but it's not always possible and I will check to see how these cases are being handled. (Xerces has the same problem, and I think that it sometimes gives up and treats the constraint as if it were maxOccurs="unbounded"). If we do find a problem here, the best solution might be to replace the maxOccurs with an xs:assert.

#2 Updated by Tomas Vanhala 13 days ago

Thank you for the initial comments.

  1. The schema documents have been authored (by an associated party) as follows: Each xsd file which has the filename prefix "NctsDme_FITransit" (13 files) defines an "xml message", and these files share the same target namespace.

Each one of the mentioned 13 files includes and imports the same set of xsd files.

We need to set MULTIPLE_SCHEMA_IMPORTS because due to the schema design, the same xsd files are imported multiple times.

  1. About maxOccurs: We are not able to influence the maxOccurs values. The main reason we wish to move to Saxon is because of this optimisation you mention.

#3 Updated by Michael Kay 5 days ago

I've been trying to get this to run without success. I don't think I understand the strategy for URI resolution. Should all schema documents be found within the ZIP file, or is the external directory also relevant (perhaps it's just a copy of what's in the ZIP file?)

I'm pretty sure the fact that the files are in a ZIP archive isn't relevant to the problem, and just complicates the repro.

Is there a single root schema document that includes/imports all the others? It seems to start by supplying a long list of independent schema documents.

#4 Updated by Michael Kay 2 days ago

  • Status changed from New to AwaitingInfo

Please register to edit this issue

Also available in: Atom PDF