Project

Profile

Help

Bug #2271

closed

AIOOBE with large xml file

Added by Tomaž Erjavec about 10 years ago. Updated about 9 years ago.

Status:
Closed
Priority:
Normal
Category:
Internals
Sprint/Milestone:
Start date:
2014-12-21
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.6
Fix Committed on Branch:
9.6
Fixed in Maintenance Release:
Platforms:

Description

Hi,

Saxon gives me an array index out of bounds when I try to process a large file and this happens even with an empty stylesheet. I can understand that it wouldn't work, but with an exception saying out of memory, but not AIOOBE.

I'm using Saxon 9.6.0.3 (I tried with some older versions, same problem) with java 1.8.0_25:

Java(TM) SE Runtime Environment (build 1.8.0_25-b17)

Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Below is the trace.

All the best,

Tomaž

PS: I can send the file if it would help.

$ du -h blog.bug.xml

2,8G blog.bug.xml

$ java -jar /usr/local/bin/saxon9he.jar -xsl:empty.xsl blog.bug.xml > bug.vert

java.lang.ArrayIndexOutOfBoundsException: -32768

    at net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:90)

    at net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java:405)

    at net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuilder.java:380)

    at net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder.java:362)

    at net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:544)

    at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:435)

    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)

    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)

    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)

    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)

    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)

    at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)

    at net.sf.saxon.event.Sender.send(Sender.java:171)

    at net.sf.saxon.Controller.transform(Controller.java:1690)

    at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:547)

    at net.sf.saxon.Transform.processFile(Transform.java:1056)

    at net.sf.saxon.Transform.doTransform(Transform.java:659)

    at net.sf.saxon.Transform.main(Transform.java:80)

Fatal error during transformation: java.lang.ArrayIndexOutOfBoundsException: -32768

Actions #1

Updated by Michael Kay about 10 years ago

  • Status changed from New to In Progress

Thanks for reporting it. The relevant code has been unchanged for quite a while. Two immediate theories come to mind: (a) it's something to do with Java 1.8 (which we haven't tested on), and (b) it's some rarely-occurring boundary condition like an end tag occurring with an offset that's an exact multiple of 1Mb. The first theory can be tested by running on a different Java version, the second by sending us the source file.

Looking at the code, I can see it crashing if the total amount of character data in a document (after expanding entities) exceeds 2^31 characters. Could that be the case here?

I've looked at the problem of crossing the 2Gb threshold and it's not going to be easy (except in streaming mode) - there's no support in Java for strings or arrays beyond this size, so we would be unable to represent the string value of a document node using a Java String or CharSequence, which is a pretty daunting prospect. With Saxon-EE streaming you can process a source document above 2Gb and split it into multiple document trees, and that may be the best we can do.

Actions #2

Updated by Tomaž Erjavec about 10 years ago

Thanks for the quick answer. I don't think we have the old Java on the machine any more (it's is our sys gurus that just made the major annual upgrade), so I can't test it myself (but maybe I will find a machine here that does have it).

I put the source files on http://nl.ijs.si/et/tmp/saxon/ (just please let me know when you have them, so I can remove them).

As you note, it's probably not a big thing as it won't work on such a big file anyway, still..

I will try with streaming mode, as soon as I figure out how to do it :)

Best,

Tomaž

PS: just had another crash reported with saxon on a web service - it looks like the version bundled with java 1.8 has problems with java 1.8. But that is probably something to report to Java folks, not you!

Actions #3

Updated by Michael Kay about 10 years ago

  • Found in version changed from 1.8.0_25 to 9.6

I've reproduced the problem, and it's nothing to do with Java 8. It's simply hitting the limit of 1G characters allowed in text nodes in a TinyTree document. I'm committing a patch that makes it fail cleanly when this limit is reached.

(The source document is 2.9Gb, and it appears to consist largely of text with very little markup).

It's possible that you would get further with the linked tree model (I tried it and ran out of memory). In theory you would be able to create the tree successfully, and would only hit problems if you try and get the string value of the root node (which would blow the Java limit for a String or CharSequence).

As memory gets larger we're probably going to have to think about how to handle larger source documents. It won't be easy unless Java gives us some help: we would have to avoid data structures involving large strings or arrays, and we would have to change some APIs, which would all be rather painful.

Actions #4

Updated by Tomaž Erjavec about 10 years ago

Thanks for running the tests and making the patch. OK, so file is simply

too big; still, it is, I guess, nice that a misleading error is not

reported.

I tried processing with the linked tree model - after one hour my

machine also ran out of heap space..

So, the moral is to work with smaller files, at least until Java handles

larger structures.

All the best,

Tomaž

Dne 22.12.2014 ob 18:37 je Saxonica Developer Community zapisal(a):

Actions #5

Updated by Michael Kay about 10 years ago

  • Status changed from In Progress to Resolved

I've been thinking a little about how one might tackle this, without relying on any Java changes.

It would be easy to change the LargeStringBuffer to support 2^31 segments of 2^16 characters, instead of 2^15 segments as at present. It wouldn't be possible to implement CharSequence properly, because CharSequence uses int offsets; but there's actually no great need for LargeStringBuffer to implement CharSequence. We would also need a variant of the TinyTree that uses longs instead of ints for the offsets into the buffer. (Actually, I don't think there's a really great need for the string value of the document to be held in pseudo-contiguous storage at all, but it would reduce the changes needed to keep it that way).

If someone actually asks for the string value of the root node, which will be longer than 2^31 characters, we can return a StringValue that contains pointers into the LargeStringBuffer. That's no great problem at the XPath level. At the Java API level we would need to make changes to NodeInfo.getStringValue(), but I think we could do this by retaining this method and having it throw an exception if the string value is too long, and providing an alternative method for getting the string value without the 2^31 limit.

In 9.6 we've started making greater use of the interface UnicodeString which provides direct addressing into strings using Unicode codepoint counting rather than 16-bit char counting. This also has the advantage that strings using only Latin-1 characters only need 8 bits per character. We could easily extend this interface to use longs rather than ints for codepoint addressing, and we could underpin it with something like the LargeStringBuffer data structure to bypass Java limits on string and array sizes. So I think it's do-able.

(Just been looking at the specs for current MacBooks and I'm actually slightly surprised that they're not very much higher than my early-2011 model. Perhaps things are reaching a plateau? Who knows.)

Actions #6

Updated by Tomaž Erjavec about 10 years ago

I won't pretend to understand Saxon datastructures and functions that

need to be modified but, yes, for me it would certainly be nice if I

could feed large files to XSLT; I'm working with corpora, where there

are lots of conversions to be done, and working on the whole corpus /

file is of course very convenient (it's not such a bit thing to split

it, but it's one more layer of complication). And my server has 47G of

memory, so no problem there.

So, very glad that it seems doable, and fingers crossed that a dark cold

winter day comes along to provide the opportunity to do it :)

All the best,

Tomaž

Dne 24.12.2014 ob 0:06 je Saxonica Developer Community zapisal(a):

Actions #7

Updated by O'Neil Delpratt about 10 years ago

  • % Done changed from 0 to 100
  • Fixed in version set to 9.6.0.4

Bug fix applied in the Saxon 9.6.0.4 maintenance release.

Actions #8

Updated by O'Neil Delpratt about 10 years ago

  • Status changed from Resolved to Closed
Actions #9

Updated by Ryan Baumann over 9 years ago

Is this really fixed, or is it actually "won't fix"? When I run XSLT against a document which previously resulted in this error under Saxon-HE 9.6.0.5, I now just get the error:

java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes
        at net.sf.saxon.tree.tiny.LargeStringBuffer.addSegment(LargeStringBuffer.java:58)
        at net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:120)
        at net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java:405)
        at net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuilder.java:380)
        at net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder.java:362)
        at net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:544)
        at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:435)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1781)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
        at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
        at net.sf.saxon.event.Sender.send(Sender.java:171)
        at net.sf.saxon.Controller.transform(Controller.java:1692)
        at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:547)
        at net.sf.saxon.Transform.processFile(Transform.java:1056)
        at net.sf.saxon.Transform.doTransform(Transform.java:659)
        at net.sf.saxon.Transform.main(Transform.java:80)
Fatal error during transformation: java.lang.IllegalStateException: Source document too large: more than 1G characters in text nodes

This is reproducible by running even an identity transform against a large XML file (e.g. enwiktionary-latest-pages-meta-current.xml.bz2 from https://dumps.wikimedia.org/enwiktionary/latest/).

Actions #10

Updated by O'Neil Delpratt over 9 years ago

This looks like a won't fix issue after reading comment #5. I will ask Mike to confirm.

Actions #11

Updated by Michael Kay over 9 years ago

It was 'resolved/fixed" in the sense that we now produce an error message saying that Saxon limits have been exceeded, rather than an AIOOBE exception. Anything more than this would be a significant enhancement, not a simple bug fix.

As reported, I did look at the possibility of supported larger documents beyond the 2G limit. With strings and arrays in Java limited to 32-bit int addressing, it's not at all easy. The limits for a LinkedTree would probably be rather higher than for the TinyTree, which tries to reduce the number of Java objects and therefore ends up with objects whose size appriximates to the document size. It might be worth seeing how far you get with the Linked Tree: the actual amount of memory used will be higher, but I don't think you'll hit the same limits.

Actions #12

Updated by O'Neil Delpratt about 9 years ago

  • Sprint/Milestone set to 9.6.0.4
  • Applies to branch 9.6 added
  • Fix Committed on Branch 9.6 added
  • Fixed in Maintenance Release 9.6.0.4 added
Actions #13

Updated by O'Neil Delpratt about 9 years ago

  • Sprint/Milestone changed from 9.6.0.4 to 9.6.0.3
  • Fixed in Maintenance Release 9.6.0.3 added
  • Fixed in Maintenance Release deleted (9.6.0.4)
Actions #14

Updated by O'Neil Delpratt about 9 years ago

  • Sprint/Milestone changed from 9.6.0.3 to 9.6.0.4
  • Fixed in Maintenance Release 9.6.0.4 added
  • Fixed in Maintenance Release deleted (9.6.0.3)

Please register to edit this issue

Also available in: Atom PDF