Bug #3348
closedRegression in Saxon 9.7 XSLT when normalizing namespaces on large XML
Added by Nick Nunes over 7 years ago. Updated over 7 years ago.
100%
Description
I have a stylesheet designed to put an XML file in namespace normal form and to standardize some prefix bindings. When using this stylesheet on a 15mb XML file with Saxon 9.7 it appears to go into an infinite loop. Running the same transformation in 9.5 and 9.6 have execution times of approximately 13s, with 9.7 it's still running after 5m. Turning on tracing in Saxon 9.7 causes execution to finish in about 1m. I did not notice any problems using this stylesheet with 9.7 when input file sizes were smaller (up to 6mb).
Unfortunately, I can't offer much more information to diagnose with and am unable to distill this down to a toy case that I can share publicly. If this isn't a known problem is there a way I can share the relevant xsl and xml privately?
Files
openfilesandports.txt (2.87 KB) openfilesandports.txt | Nick Nunes, 2017-07-20 23:00 |
Updated by Michael Kay over 7 years ago
Switching tracing on causes certain optimizations to be suppressed, so my first impression would be that it's some optimization failure. Try running with ~~opt:0 to see if that makes a difference. Or try with Saxon 9.8 ~~ it will all help to create more data points.
You're welcome to send us the files privately to support at saxonica dot com.
Updated by Nick Nunes over 7 years ago
The failure is in both 9.7 and 9.8. Either using -opt:0 or removing -sa makes the problem go away. I'll zip up some files and send them your way.
Updated by Michael Kay over 7 years ago
Running this without any URIResolver, I am getting the following -t output:
Saxon 9.7
Saxon-EE 9.7.0.18J from Saxonica
Java version 1.8.0_121
Using license serial number V005192
Generating byte code...
Stylesheet compilation time: 3.06722s (3067.220257ms)
Processing file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml
Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/mathml/mathml3/mathml3.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Tree built in 16.016704s (16016.704256ms)
Tree size: 772774 nodes, 5400351 characters, 254915 attributes
Execution time: 16.45997s (16459.970458ms)
Memory used: 163329640
Saxon 9.8
Saxon-EE 9.8.0.3J from Saxonica
Java version 1.8.0_121
Using license serial number V005192
Stylesheet compilation time: 3.285539s (3285.539679ms)
Processing file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml
Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/mathml/mathml3/mathml3.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Tree built in 16.017399s (16017.399229ms)
Tree size: 772774 nodes, 5400351 characters, 254915 attributes
Execution time: 16.431265s (16431.265782ms)
Memory used: 86,904,920
The 16s here is consistent with the time you are seeing for 9.5 and 9.6, and since it's all tree-building time (with a document that's only 15Mb, which isn't huge nowadays) I suspect the time is spent fetching DTD resources from the web.
So basically, I'm not seeing the "infinite loop" you report with 9.7.
Can you reproduce this effect with the -t option set? Does the "infinite loo" occur before or after tree building finishes?
Is the URI resolver relevant?
Updated by Michael Kay over 7 years ago
- Category set to Performance
- Status changed from New to In Progress
- Assignee set to Michael Kay
Updated by Michael Kay over 7 years ago
- Status changed from In Progress to AwaitingInfo
Updated by Nick Nunes over 7 years ago
- File openfilesandports.txt openfilesandports.txt added
Sorry for the slow response, I've been on holiday (and will continue to be through the beginning of August).
With the -t option set, it actually did complete albeit very slowly. Here's the output:
Saxon-EE 9.7.0.19J from Saxonica
Java version 1.8.0_65
Using license serial number S005550
Using parser org.apache.xml.resolver.tools.ResolvingXMLReader
Generating byte code...
Stylesheet compilation time: 348.214409ms
Processing file:/Users/nunes/Desktop/normalize-ns-bug/9781617052439_BitsXML.xml
Building tree for file:/Users/nunes/Desktop/normalize-ns-bug/9781617052439_BitsXML.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 897.54913ms
Tree size: 772774 nodes, 5400351 characters, 254915 attributes
Execution time: 5m 5.044158s (305044.158618ms)
Memory used: 691137248
If it helps I've attached the list of open files and ports for the java process before it terminated, but after tiny tree building was complete.
Updated by Nick Nunes over 7 years ago
Without the resolver included, I'm seeing a similar timing profile:
Saxon-EE 9.7.0.19J from Saxonica
Java version 1.8.0_65
Using license serial number S005550
URIResolver.resolve href="library.normalize.namespaces.xsl" base="file:/Users/nunes/Desktop/normalize-ns-bug/normalize.namespaces.xsl"
Generating byte code...
Stylesheet compilation time: 362.688749ms
Processing file:/Users/nunes/Desktop/normalize-ns-bug/9781617052439_BitsXML.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/nunes/Desktop/normalize-ns-bug/9781617052439_BitsXML.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/mathml/mathml3/mathml3.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Tree built in 6.080676s (6080.676534ms)
Tree size: 772774 nodes, 5400351 characters, 254915 attributes
Execution time: 5m 5.480136s (305480.136328ms)
Memory used: 784554776
Updated by Michael Kay over 7 years ago
- Status changed from AwaitingInfo to In Progress
I've just realised I was running library.normalize.namespaces.xsl rather than normalize.namespaces.xsl. I have now reproduced the problem, so it should be possible to diagnose it.
Updated by Michael Kay over 7 years ago
The first thing I always do with a problem like this is to run with ~~TP:profile.html ~~ but this makes the problem go away (as you mentioned). Next step is to get a Java-level profile.
Updated by Michael Kay over 7 years ago
Java profiling shows all the time being spent in com.saxonica.ee.optim.MultiIndex.buildIndex, which is constructing an index to do join optimization, and I suspect the culprit is this expression:
for $ns in ($doc//namespace::*[. eq $ns]) ...
(I don't think the fact that you have two variables called $ns should cause a problem, but it's worth checking!)
Updated by Michael Kay over 7 years ago
It seems that $doc//namespace::* evaluates to a sequence of 2173290 namespace nodes.
The good news is that we only build this index once. The bad news is that it takes a long time.
13:07:29 BuildIndex com.saxonica.ee.optim.IndexedValue@7d9d1a19 at 91
13:12:20 Finished BuildIndex
13:12:20 BuildIndex com.saxonica.ee.optim.IndexedValue@589838eb at 150
13:12:20 Finished BuildIndex
13:12:20 BuildIndex com.saxonica.ee.optim.IndexedValue@42dafa95 at 150
13:12:20 Finished BuildIndex
13:12:20 BuildIndex com.saxonica.ee.optim.IndexedValue@6500df86 at 150
13:12:20 Finished BuildIndex
I think the problem is that the index structure we are using doesn't cope at all well when there are large numbers of duplicate keys. For each distinct key value, we create an index entry for that key, where the value is an IntArraySet holding the positions of the items with that key; an IntArraySet is basically a sorted list of integers. This means that adding a new entry when there are many duplicates is potentially expensive.
I'm pretty sure we only ever append to the IntArraySet in monotonically increasing order, so clearly a better data structure could be used. In fact I think the only functionality we get from IntArraySet that we couldn't get from a simple list of integers is the ability to form the union of two sets by a merge operation. The cases where this is needed are probably unusual anyway.
Updated by Michael Kay over 7 years ago
I've implemented a variant of IntArraySet that only allows appending of new integers monotonically at the end, and this dramatically improves the timings. Running with -repeat:10 to eliminate Java warmup time, I'm now seeing tree building time of 74ms plus execution time averaging 4.14 seconds.
Without the repeat option I get
Saxon-EE 9.8.0.3J from Saxonica
Java version 1.8.0_121
Using license serial number V005192
URIResolver.resolve href="library.normalize.namespaces.xsl" base="file:/Users/mike/bugs/2017/nunes3348/normalize.namespaces.xsl"
Stylesheet compilation time: 681.370339ms
Processing file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml
Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/bugs/2017/nunes3348/BitsXML.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/mathml/mathml3/mathml3.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Tree built in 14.92303s (14923.030059ms)
Tree size: 772774 nodes, 5400351 characters, 254915 attributes
Execution time: 20.046453s (20046.453766ms)
Memory used: 1,058,588,880
I will now apply the same fix to the 9.7 branch, and regression test.
I haven't tried to study how this executed in 9.5 and 9.6. The code in the relevant area looks unchanged. I suspect that it wasn't attempting indexing at all for this expression (which implies that the indexing is not actually achieving anything useful: that's not surprising since there's probably a very small number of distinct namespaces, and therefore finding them by serial search is likely to be just as fast. But we can't know that...
Updated by Michael Kay over 7 years ago
- Status changed from In Progress to Resolved
- Applies to branch 9.7, 9.8 added
- Fix Committed on Branch 9.7, 9.8 added
Updated by O'Neil Delpratt over 7 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 9.7.0.20 added
Bug fix applied in the Saxon 9.7.0.20 maintenance release.
Updated by O'Neil Delpratt over 7 years ago
- Status changed from Resolved to Closed
- Fixed in Maintenance Release 9.8.0.4 added
Bug fix applied in the 9.8.0.4 maintenance release.
Please register to edit this issue