Bug #2271
closedAIOOBE with large xml file
100%
Description
Hi,
Saxon gives me an array index out of bounds when I try to process a large file and this happens even with an empty stylesheet. I can understand that it wouldn't work, but with an exception saying out of memory, but not AIOOBE.
I'm using Saxon 9.6.0.3 (I tried with some older versions, same problem) with java 1.8.0_25:
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
Below is the trace.
All the best,
Tomaž
PS: I can send the file if it would help.
$ du -h blog.bug.xml
2,8G blog.bug.xml
$ java -jar /usr/local/bin/saxon9he.jar -xsl:empty.xsl blog.bug.xml > bug.vert
java.lang.ArrayIndexOutOfBoundsException: -32768
at net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:90)
at net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java:405)
at net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuilder.java:380)
at net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder.java:362)
at net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:544)
at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:435)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
at net.sf.saxon.event.Sender.send(Sender.java:171)
at net.sf.saxon.Controller.transform(Controller.java:1690)
at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:547)
at net.sf.saxon.Transform.processFile(Transform.java:1056)
at net.sf.saxon.Transform.doTransform(Transform.java:659)
at net.sf.saxon.Transform.main(Transform.java:80)
Fatal error during transformation: java.lang.ArrayIndexOutOfBoundsException: -32768
Updated by Michael Kay about 10 years ago
- Status changed from New to In Progress
Thanks for reporting it. The relevant code has been unchanged for quite a while. Two immediate theories come to mind: (a) it's something to do with Java 1.8 (which we haven't tested on), and (b) it's some rarely-occurring boundary condition like an end tag occurring with an offset that's an exact multiple of 1Mb. The first theory can be tested by running on a different Java version, the second by sending us the source file.
Looking at the code, I can see it crashing if the total amount of character data in a document (after expanding entities) exceeds 2^31 characters. Could that be the case here?
I've looked at the problem of crossing the 2Gb threshold and it's not going to be easy (except in streaming mode) - there's no support in Java for strings or arrays beyond this size, so we would be unable to represent the string value of a document node using a Java String or CharSequence, which is a pretty daunting prospect. With Saxon-EE streaming you can process a source document above 2Gb and split it into multiple document trees, and that may be the best we can do.
Updated by Tomaž Erjavec almost 10 years ago
Thanks for the quick answer. I don't think we have the old Java on the machine any more (it's is our sys gurus that just made the major annual upgrade), so I can't test it myself (but maybe I will find a machine here that does have it).
I put the source files on http://nl.ijs.si/et/tmp/saxon/ (just please let me know when you have them, so I can remove them).
As you note, it's probably not a big thing as it won't work on such a big file anyway, still..
I will try with streaming mode, as soon as I figure out how to do it :)
Best,
Tomaž
PS: just had another crash reported with saxon on a web service - it looks like the version bundled with java 1.8 has problems with java 1.8. But that is probably something to report to Java folks, not you!
Updated by Michael Kay almost 10 years ago
- Found in version changed from 1.8.0_25 to 9.6
I've reproduced the problem, and it's nothing to do with Java 8. It's simply hitting the limit of 1G characters allowed in text nodes in a TinyTree document. I'm committing a patch that makes it fail cleanly when this limit is reached.
(The source document is 2.9Gb, and it appears to consist largely of text with very little markup).
It's possible that you would get further with the linked tree model (I tried it and ran out of memory). In theory you would be able to create the tree successfully, and would only hit problems if you try and get the string value of the root node (which would blow the Java limit for a String or CharSequence).
As memory gets larger we're probably going to have to think about how to handle larger source documents. It won't be easy unless Java gives us some help: we would have to avoid data structures involving large strings or arrays, and we would have to change some APIs, which would all be rather painful.
Updated by Tomaž Erjavec almost 10 years ago
Thanks for running the tests and making the patch. OK, so file is simply
too big; still, it is, I guess, nice that a misleading error is not
reported.
I tried processing with the linked tree model - after one hour my
machine also ran out of heap space..
So, the moral is to work with smaller files, at least until Java handles
larger structures.
All the best,
Tomaž
Dne 22.12.2014 ob 18:37 je Saxonica Developer Community zapisal(a):
Updated by Michael Kay almost 10 years ago
- Status changed from In Progress to Resolved
I've been thinking a little about how one might tackle this, without relying on any Java changes.
It would be easy to change the LargeStringBuffer to support 2^31 segments of 2^16 characters, instead of 2^15 segments as at present. It wouldn't be possible to implement CharSequence properly, because CharSequence uses int offsets; but there's actually no great need for LargeStringBuffer to implement CharSequence. We would also need a variant of the TinyTree that uses longs instead of ints for the offsets into the buffer. (Actually, I don't think there's a really great need for the string value of the document to be held in pseudo-contiguous storage at all, but it would reduce the changes needed to keep it that way).
If someone actually asks for the string value of the root node, which will be longer than 2^31 characters, we can return a StringValue that contains pointers into the LargeStringBuffer. That's no great problem at the XPath level. At the Java API level we would need to make changes to NodeInfo.getStringValue(), but I think we could do this by retaining this method and having it throw an exception if the string value is too long, and providing an alternative method for getting the string value without the 2^31 limit.
In 9.6 we've started making greater use of the interface UnicodeString which provides direct addressing into strings using Unicode codepoint counting rather than 16-bit char counting. This also has the advantage that strings using only Latin-1 characters only need 8 bits per character. We could easily extend this interface to use longs rather than ints for codepoint addressing, and we could underpin it with something like the LargeStringBuffer data structure to bypass Java limits on string and array sizes. So I think it's do-able.
(Just been looking at the specs for current MacBooks and I'm actually slightly surprised that they're not very much higher than my early-2011 model. Perhaps things are reaching a plateau? Who knows.)
Updated by Tomaž Erjavec almost 10 years ago
I won't pretend to understand Saxon datastructures and functions that
need to be modified but, yes, for me it would certainly be nice if I
could feed large files to XSLT; I'm working with corpora, where there
are lots of conversions to be done, and working on the whole corpus /
file is of course very convenient (it's not such a bit thing to split
it, but it's one more layer of complication). And my server has 47G of
memory, so no problem there.
So, very glad that it seems doable, and fingers crossed that a dark cold
winter day comes along to provide the opportunity to do it :)
All the best,
Tomaž
Dne 24.12.2014 ob 0:06 je Saxonica Developer Community zapisal(a):
Updated by O'Neil Delpratt almost 10 years ago
- % Done changed from 0 to 100
- Fixed in version set to 9.6.0.4
Bug fix applied in the Saxon 9.6.0.4 maintenance release.
Updated by O'Neil Delpratt almost 10 years ago
- Status changed from Resolved to Closed
Updated by Ryan Baumann over 9 years ago
Is this really fixed, or is it actually "won't fix"? When I run XSLT against a document which previously resulted in this error under Saxon-HE 9.6.0.5, I now just get the error:
java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes
at net.sf.saxon.tree.tiny.LargeStringBuffer.addSegment(LargeStringBuffer.java:58)
at net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:120)
at net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java:405)
at net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuilder.java:380)
at net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder.java:362)
at net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:544)
at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:435)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1781)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
at net.sf.saxon.event.Sender.send(Sender.java:171)
at net.sf.saxon.Controller.transform(Controller.java:1692)
at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:547)
at net.sf.saxon.Transform.processFile(Transform.java:1056)
at net.sf.saxon.Transform.doTransform(Transform.java:659)
at net.sf.saxon.Transform.main(Transform.java:80)
Fatal error during transformation: java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes
This is reproducible by running even an identity transform against a large XML file (e.g. enwiktionary-latest-pages-meta-current.xml.bz2 from https://dumps.wikimedia.org/enwiktionary/latest/).
Updated by O'Neil Delpratt over 9 years ago
This looks like a won't fix issue after reading comment #5. I will ask Mike to confirm.
Updated by Michael Kay over 9 years ago
It was 'resolved/fixed" in the sense that we now produce an error message saying that Saxon limits have been exceeded, rather than an AIOOBE exception. Anything more than this would be a significant enhancement, not a simple bug fix.
As reported, I did look at the possibility of supported larger documents beyond the 2G limit. With strings and arrays in Java limited to 32-bit int addressing, it's not at all easy. The limits for a LinkedTree would probably be rather higher than for the TinyTree, which tries to reduce the number of Java objects and therefore ends up with objects whose size appriximates to the document size. It might be worth seeing how far you get with the Linked Tree: the actual amount of memory used will be higher, but I don't think you'll hit the same limits.
Updated by O'Neil Delpratt about 9 years ago
- Sprint/Milestone set to 9.6.0.4
- Applies to branch 9.6 added
- Fix Committed on Branch 9.6 added
- Fixed in Maintenance Release 9.6.0.4 added
Updated by O'Neil Delpratt about 9 years ago
- Sprint/Milestone changed from 9.6.0.4 to 9.6.0.3
- Fixed in Maintenance Release 9.6.0.3 added
- Fixed in Maintenance Release deleted (
9.6.0.4)
Updated by O'Neil Delpratt about 9 years ago
- Sprint/Milestone changed from 9.6.0.3 to 9.6.0.4
- Fixed in Maintenance Release 9.6.0.4 added
- Fixed in Maintenance Release deleted (
9.6.0.3)
Please register to edit this issue