Project

Profile

Help

Bug #1617

closed

Slow validation of a schema hierarchy

Added by Adrian Buza about 12 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2012-08-28
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Validating the XML instance: "Relation_Image_List13Aug12 .xml" or the XML schema: "relations_image.xsd" (from the attached ZIP file) takes around 1 minute with Saxon-EE (9.4 and 9.3), but validates instantaneously (<1 second) with Xerces. We've made this comparison within Oxygen, but Saxon-EE exhibits the same performance in the command line.


Files

CytometryML_Schemas14Aug12.zip (295 KB) CytometryML_Schemas14Aug12.zip Adrian Buza, 2012-08-28 14:16
Actions #1

Updated by Michael Kay about 12 years ago

  • Status changed from New to In Progress
  • Assignee set to Michael Kay

Many thanks for reporting this. We will investigate. We're always pleased to have examples of real jobs whose performance needs to be improved.

Actions #2

Updated by Michael Kay about 12 years ago

The problem is in the handling of this type, defined in solution.xsd:

    <complexType name="Reagents_Type">

	<sequence minOccurs="1" maxOccurs="100">

		<element name="Solute" type="solution:Solute_Type" minOccurs="0" maxOccurs="100"/>

		<element name="Solvent" type="solution:Solvent_Type" minOccurs="0" maxOccurs="10"/>

	</sequence>

</complexType>

Because of the two nested maxOccurs values, this is a very demanding content model, and the classic textbook algorithms perform very poorly on it (they use a lot of time and a lot of memory). Saxon uses a smarter approach with counters for simple non-nested maxOccurs models, but that doesn't work in a case like this which is "weakly ambiguous". For example the model allows a sequence of 50 Solute elements followed by a Solvent element followed by 350 Solute elements, and matching this involves potential back-tracking. Henry Thomson has published an algorithm for content models with nested counters which can handle more cases efficiently than Saxon can, but it still breaks down on some "weakly ambiguous" models - I'm not sure whether it will handle this one.

I've no idea how Xerces handles this case!

I think the biggest improvement we could make with such schemas is to compile the finite state machine for a complex type lazily. At present the cost is being incurred whether or not the type is actually used in a given validation episode. One disadvantage of lazy compilation is that we detect UPA ambiguities as part of the process of building the finite state machine, so UPA violations would not be detected in a complex type unless it is actually used. We could perhaps get around this with a switch, similar to the one in Xerces, whereby complete validation of the schema is performed only if requested.

Actions #3

Updated by Michael Kay over 10 years ago

I suspect the major cost here is in generating the finite state machine (which doesn't call into the class where Saxon is able to use counters). Handling it the same way we now handle regular expressions, by a crude backtracking mechanism, would avoid the compile time cost, and would probably perform reasonably at run-time except in worst-case scenarios. In fact it's very tempting to translate validation against this content model into a regular expression match; the main disadvantage would be that with the FSM approach, diagnostics are much better when the instance is invalid.

Actions #4

Updated by Michael Kay about 10 years ago

We should consider handling such content models the way the regex engine handles them, by recursive backtracking - not using an FSM at all. The main disadvantage is that it isn't streamable; also it's more difficult to prove subsumption.

Actions #5

Updated by O'Neil Delpratt almost 10 years ago

  • Status changed from In Progress to Closed

We are closing this for now and opening a new bug issue on our future release project to ensure this does not get forgotten.

Please register to edit this issue

Also available in: Atom PDF