Streaming a document with a very large text node

Added by Michael Schäfer about 5 years ago

I'm using Saxon EE 9.7.0.21 with Java 1.8.0.161.

Streaming a document of little less than 2GB fails due to either OutOfMemoryError or NegativeArraySizeException; this varies a bit according to the max heap size setting (I can allocate up to 8GB).

This is the output when running Saxon with -t:

Saxon-EE 9.7.0.21J from Saxonica
Java version 1.8.0_161
Using license serial number S005664
Generating byte code...
Stylesheet compilation time: 827.3421ms
Processing file:/C:/Temp/elster/Zerleger/etc/../xml/ESTGS.ec3543kqamrf0432c90t00gag0rreo7m.20191220093719.a50743b4-fb90-4683-ada3-999c2e29b8ee.xml
Streaming file:/C:/Temp/elster/Zerleger/etc/../xml/ESTGS.ec3543kqamrf0432c90t00gag0rreo7m.20191220093719.a50743b4-fb90-4683-ada3-999c2e29b8ee.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
...left out custom messages
Loading net.sf.saxon.option.local.Numberer_de
...left out custom messages
Writing to file:/C:/Temp/elster/Zerleger/xml/DS-ESTGS1-20200131-113204.492-B5-base64.txt
java.lang.NegativeArraySizeException
        at net.sf.saxon.event.ReceivingContentHandler.characters(ReceivingContentHandler.java:479)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.characters(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:451)
        at net.sf.saxon.event.Sender.send(Sender.java:179)
        at net.sf.saxon.Controller.transformStream(Controller.java:2534)
        at net.sf.saxon.Controller.transform(Controller.java:1878)
        at net.sf.saxon.s9api.Xslt30Transformer.applyTemplates(Xslt30Transformer.java:565)
        at net.sf.saxon.Transform.processFile(Transform.java:1253)
        at net.sf.saxon.Transform.doTransform(Transform.java:795)
        at net.sf.saxon.Transform.main(Transform.java:77)
Fatal error during transformation: java.lang.NegativeArraySizeException:  (no message)

According to this, the document is streamed. I can provide the source code if necessary.

The transform's purpose ist to extract two base64 payloads from the document; one is small and one accounts for 99% of the document size.

My hope was that streaming allowed for this to be possible, so my question is if the size of the text node constitutes a show-stopper here, and if so, if there are ways to work around it.

Thanks in advance,

Michael

Replies (6)

Please register to reply

RE: Streaming a document with a very large text node - Added by Vladimir Nesterovsky about 5 years ago

The transform's purpose ist to extract two base64 payloads from the document; one is small and one accounts for 99% of the document size.

I would dare to say that there is a chance that xslt is not the best tool for such task.

It might be better to use SAX handler or even plain text reader here.

RE: Streaming a document with a very large text node - Added by Michael Schäfer about 5 years ago

We started using non-streaming transforms for simplicity and with small playloads. Performance is not an issue as we receive the documents at monthly intervals. But now the sender has switched to transmitting larges files and refuses to go back, as it supposedly makes their workflow simpler.

So the hope was that streaming might be a solution. And possibly it could, if the text node would be created lazily only if touched by a stylesheet instruction, and otherwise passed through to the output without loading the complete content into memory. But this is only an assumption of mine. However, if so, it might be an interesting optimisation.

RE: Streaming a document with a very large text node - Added by Michael Kay about 5 years ago

I was wondering how long it would be before we started hitting the 32-bit size limits in Java for strings and arrays...

Java has 32-bit limits all over the place, and Saxon doesn't make any systematic attempt to circumvent them. You're hitting a limit here that the contents of a single text node are assembled into a char[] array, and that's blowing the 32-bit addressing limit on arrays. I'm afraid streaming doesn't help with that.

I had been hoping that Java would come up with workarounds for the 32 bit limits before they became a problem, but I don't see much evidence of that happening.

We do have a LargeStringBuffer object that enables us to hold the string value of a document when that exceeds 2Gb; but we don't use it for assembling a single text node. We could, but we don't. The main reason is that there's no point tackling this piecemeal: if we're going to remove these limits then we need to do it comprehensively, which (a) is very hard to test, (b) requires extensive API changes, and (c) is likely to have an adverse impact on the 99% of users who don't need the feature. So we've been postponing the day.

RE: Streaming a document with a very large text node - Added by Michael Kay about 5 years ago

You might be able to make progress by splitting large text nodes into pieces. For example, you could insert a SAX filter between the XML parser and Saxon which inserts occasional empty processing instructions to prevent text nodes getting too large. You could then filter these out on the output pipeline. Or your filter could divert the text node content into some "out of band" reservoir, and replace it with a processing instruction that identifies its temporary location.

With the requirement being to extract a base64 chunk held in a monster text node, I think a SAX filter is probably the way to go. You've still got a challenge doing anything with the Base64 - most base64 decoders are probably going to have the same 32-bit limit on the input size (and perhaps on the output size as well). A streaming Base64 decoder sounds feasible - perhaps https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64InputStream.html does the job. But it works in pull mode (it reads an InputStream), whereas SAX works in push mode, so your problems aren't over. Writing a push mode streaming base64 decoder is probably only a few days work...

RE: Streaming a document with a very large text node - Added by Michael Kay about 5 years ago

In fact it seems Apache also has a push mode base64 decoder:

http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64OutputStream.html

RE: Streaming a document with a very large text node - Added by Michael Schäfer about 5 years ago

Thank you very much for your valuable suggestions. I already feared that my thinking was overly simplistic. I'll try the SAX filter, combined with the decoder. Currently, decoding is done using shell commands after extracting the payload (on a Linux Server). If we use a SAX filter to divert the payload, it suggests itself to add decoding, provided it's only a moderate extra effort.

By the way, the data are from a tax authority, so it's not really surprising they hit some limits ;)

(1-6/6)

Please register to reply

Project

Profile

Help

Saxon