Project

Profile

Help

XSLT 3.0 grouping sample using streaming runs out of memory while non-streaming code does not

Added by Martin Honnen about 8 years ago

I am experimenting with streaming and tried to write an XSLT 3.0 stylesheet using streaming and @xsl:for-each-group group-by@ to identify and output those groups which only contain one item. To do that I use @xsl:iterate select="current-group()"@ and @xsl:break@ if @position@ is not equal to @1@:




	
	
	
	
	
	
	
		
			
				
					
						
						
							
						
						
							
								
									
								
							
							
								
							
						
					
				
			
		
	
	

The stylesheet compiles and runs fine for smaller input samples. However, when I try it on larger input samples I hoped that the shown use of @xsl:iterate select="current-group()"@ with the @xsl:break@ would avoid assembling the complete groups of more than one item in memory and that therefore the streamed processing would allow me to process large documents.

So I created an input sample favoring the above test using an XQuery sample


declare variable $n as xs:integer external := 100000;

let $n1 as xs:integer := $n
return 
{
for $i in 1 to $n1
return (
  data {$i}, 1
  ,
  
  data {$i}, 2
  ,
  
    data {$i}u
   
)
},
{for $j in 1 to $n1
return 
  for $k in 1 to 100
  return 
  data {$j}, e, {$k}

}

This creates an input sample of nearly 700MB.

I then tried to run Saxon-EE 9.7.0.8J from the command line against the sample and run out of memory, even when I use the option @-Xmx10g@ to donate 10GB of heap space to the Java VM.

The output is is

Processing file:/E:/XSLT/Streaming/input-100000.xml
Streaming file:/E:/XSLT/Streaming/input-100000.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at net.sf.saxon.tree.tiny.TinyTree.(TinyTree.java:186)
        at net.sf.saxon.tree.tiny.TinyBuilder.open(TinyBuilder.java:117)
        at net.sf.saxon.event.ProxyReceiver.open(ProxyReceiver.java:89)
        at com.saxonica.ee.stream.watch.WatchManager.startElement(WatchManager.java:217)
        at net.sf.saxon.event.StartTagBuffer.startContent(StartTagBuffer.java:236)
        at com.saxonica.ee.stream.ContentDetector.flush(ContentDetector.java:97)
        at com.saxonica.ee.stream.ContentDetector.startElement(ContentDetector.java:33)
        at net.sf.saxon.event.Stripper.startElement(Stripper.java:111)
        at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:305)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Sou
rce)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:451)
        at net.sf.saxon.event.Sender.send(Sender.java:179)
        at net.sf.saxon.Controller.transformStream(Controller.java:2490)
        at net.sf.saxon.Controller.transform(Controller.java:1840)
        at net.sf.saxon.s9api.Xslt30Transformer.applyTemplates(Xslt30Transformer.java:553)
        at net.sf.saxon.Transform.processFile(Transform.java:1239)
        at net.sf.saxon.Transform.doTransform(Transform.java:795)
        at net.sf.saxon.Transform.main(Transform.java:77)

I then tried to run a non-streaming XSLT 3.0 stylesheet




	
	
	
	
	
	
	
		
			
				
					
				
			
		
	
	

against the same input with the same memory settings and it succeeded building the tree, running the code and writing the output without problems.

So for that particular code it seems the streaming sample consumes more memory than a non-streaming sample.

Is there anything I can change in the streaming code to continue to pass the streamability analysis but consume less memory?


Replies (1)

RE: XSLT 3.0 grouping sample using streaming runs out of memory while non-streaming code does not - Added by Michael Kay about 8 years ago

With xsl:fork/xsl:for-each-group/@group-by, Saxon will always build the groups in memory. So the only benefit of streaming this code is if the total space occupied by the groups is significantly smaller than the space occupied by the source document as a whole, which is not true of this example.

Since you are only interested in the first item in each group, I would think the approach here is to to incrementally build a map identifying the keys that have been encountered so far and to process a new item only if it is not in the map. This could probably be done either using xsl:iterate or using accumulators. Something like:


  
  
     
     
       
     
  

    (1-1/1)

    Please register to reply