Project

Profile

Help

Streaming output

Added by Hans-Jürgen Rennau about 1 month ago

Upfront: apologies if this is not the right place to ask my question.

I am a dummy in streaming matters, and my habitual approach just to read the spec is discouraged by the mass and (apparently) difficult nature of the material. So perhaps someone helps me out with a tip.

Simple question: how to achieve streaming output, that is, output a document too large for memory? For example, one might want to annotate a huge input document, adding some attributes.

I had hoped for a feature of xsl:result-document, but did not seem to find it. Perhaps the output must be generated within some streaming instruction?

Tips much appreciated! Thank you.

PS: Is there somewhere - or would it make sense to create - a substantial streaming tutorial, perhaps comparable with the W3C's XSD primer? Fairly comprehensive, enabling genuine understanding and practical expertise without reading the spec?


Replies (10)

Please register to reply

RE: Streaming output - Added by Michael Kay about 1 month ago

Saxon will always attempt to stream the output when it can: you don't need to do anything special, other than ensuring that you choose a streamed output destination for the transformation (for example a JAXP StreamResult or a s9api Serializer). This is true (and has always been true) regardless of whether the input is streamed.

There are some things that will prevent output being streamed, the main one being try/catch (xsl:try has to buffer output in memory so it can be discarded in the event of a failure). The only other examples that comes to mind are xsl:on-empty / xsl:on-non-empty and xsl:fork

There were changes in the 10.x release to improve the output buffering of constructs that can generate large single text nodes, for example fn:concat(), fn:string-join(), fn:unparsed-text(), fn:xml-to-json(). These functions are now capable of executing in "push mode", writing their output incrementally to the serialized result tree, rather than accumulating it in memory.

RE: Streaming output - Added by Michael Kay about 1 month ago

As regards tutorials, I think Abel Braaksma has published some good conference papers.

RE: Streaming output - Added by Hans-Jürgen Rennau about 1 month ago

Thank you very much, Michael. You wrote - "a streamed output destination for the transformation". Can I achieve this using the command-line interface, so that the tool is pure XSLT?

RE: Streaming output - Added by Michael Kay about 1 month ago

Sure, when running from the command line the output document is always serialized, so this always applies.

RE: Streaming output - Added by Hans-Jürgen Rennau about 1 month ago

Ah - good to know, thank you! Remark: when running within the Oxygen IDE, the output is apparently not streamed. I suppose the Oxygen team can say something about configuration options required for streamed output.

RE: Streaming output - Added by Martin Honnen about 1 month ago

As for an introduction to streaming in form of a primer, does the section https://www.saxonica.com/html/documentation/sourcedocs/streaming/ in the Saxon documentation help perhaps? Or did you already read it and it did not help at all?

RE: Streaming output - Added by Hans-Jürgen Rennau about 1 month ago

Thank you for this tip, Martin - this is certainly a good start, which I had overlooked!

[A thorough tutorial might nevertheless make sense and fill what appears to me as a gap.]

RE: Streaming output - Added by Hans-Jürgen Rennau about 1 month ago

Attempting a trivial transformation, I get this error:

Error on line 48837274 column 47 of foo.xml:
  SXXP0003   Error reported by XML parser: JAXP00010004: Die akkumulierte Größe von Entitys
  ist "50.000.001" und überschreitet den Grenzwert "50.000.000", der von
  "FEATURE_SECURE_PROCESSING" festgelegt wurde.: JAXP00010004: Die akkumulierte Größe von
  Entitys ist "50.000.001" und überschreitet den Grenzwert "50.000.000", der von
  "FEATURE_SECURE_PROCESSING" festgelegt wurde.
org.xml.sax.SAXParseException; systemId: file:/C:/products/saxon/result.xml; lineNumber: 48837274; columnNumber: 47; JAXP00010004: Die akkumulierte Größe von Entitys ist "50.000.001" und überschreitet den Grenzwert "50.000.000", der von "FEATURE_SECURE_PROCESSING" festgelegt wurde.

FEATURE_SECURE_PROCESSING? Does someone have a recipe what to do?

Using SaxonEE10-3J, command-line interface. Stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
    <xsl:output indent="true"/>
    <xsl:mode streamable="yes" on-no-match="deep-skip"/>
    
    <xsl:template match="/">   
        <wiki>
            <xsl:iterate select="*/page">
                <page title="{substring(title, 1, 3)}"/>
            </xsl:iterate>
        </wiki>
    </xsl:template>
    
</xsl:stylesheet>

RE: Streaming output - Added by Michael Kay about 1 month ago

This page:

https://github.com/elastic/stream2es/issues/65

suggests setting the system property "jdk.xml.totalEntitySizeLimit (works for me using Java 8) or just totalEntitySizeLimit if that doesn't work".

I'm afraid XML parsers and their limits are not something we have any control over.

RE: Streaming output - Added by Hans-Jürgen Rennau about 1 month ago

Thank you very much, Michael - this solved the problem.

To give a concrete example: I added to the command-line call this parameter ...

-Djdk.xml.totalEntitySizeLimit=921000000

    (1-10/10)

    Please register to reply