Project

Profile

Help

Processing Lots of Input Docs Efficiently

Added by Anonymous about 16 years ago

Legacy ID: #4938142 Legacy Poster: W. Eliot Kimber (drmacro)

I'm trying to process a very large collection of DITA topics (10's of 1000s) ultimately organized by a single root DITA map. I'm using the normal DITA Open Toolkit dita2html transform which operates on a single input topic and generates a single result topic. Just using the normal Open Toolkit Ant scripts to do this the process would blow up with any amount of memory I could allocate (max 2Gig on a 32-bit machine). I am using Saxon 9 (even though the current Open Toolkit style sheets are all XSLT 1). I finally realized that the XSLT itself is simple and shouldn't cause any memory problems. Digging into the details of the processing, I realized that the Toolkit first builds a list of topics to be processed (by processing the input DITA maps) and then uses that list in an Ant <xslt> task like so: <xslt processor="trax" basedir="${dita.temp.dir}" destdir="${output.dir}" includes="${fullditatopiclist}" style="${args.xsl}"> where "${fullditatopiclist} is, in my case, a very long list indeed. So obviously the memory problem is actually a side effect of using Ant in this way. Hmph. My question is: given that I have my own code that can get the list of topics and do whatever I want with those from my own Java code, which can include running a different form of Ant task or calling the transformer directly or running a command-line app, what is the best way process this large set of files so as to minimize the memory required and also maximize performance. Or maybe the better question is: why does Ant have a memory problem here? My naive guess would be that it's creating a new Transformer instance for each file. Does the Saxon-specific Ant task avoid that (I would assume it does but I haven't been able to try it yet). Or should I be managing the use of the Transformer in my own Java code (currently I'm just using Ant from my Java code to invoke the out-of-the-box Toolkit Ant scripts)? Thanks, Eliot


Replies (3)

Please register to reply

RE: Processing Lots of Input Docs Efficiently - Added by Anonymous about 16 years ago

Legacy ID: #4938216 Legacy Poster: Michael Kay (mhkay)

>Or maybe the better question is: why does Ant have a memory problem here? My naive guess would be that it's creating a new Transformer instance for each file. Does the Saxon-specific Ant task avoid that (I would assume it does but I haven't been able to try it yet). I think it's more likely to be the other way around: they are reusing the same Transformer instance. Generally this is good practice with Xalan and bad practice with Saxon. Saxon assumes that if you reuse the Transformer, this is probably because you want to hold on to the resources associated with the Transformer, in particular the set of documents that are loaded in memory. If you want to release the resources, by far the best way is to create a new Transformer for each transformation. (Of course you should always reuse the Templates object, which represents the compiled stylesheet.) Clearly if you drive the transformations from your own Java code then you have much more control over such things. Michael Kay Saxonica

RE: Processing Lots of Input Docs Efficiently - Added by Anonymous about 16 years ago

Legacy ID: #4939029 Legacy Poster: W. Eliot Kimber (drmacro)

Thanks for the tip--I'll explore doing the processing myself. Would it be possible to add a note about the memory management characteristics of Transform to the Javadocs? I looked there but didn't see any obvious guidance there. Thanks, Eliot

RE: Processing Lots of Input Docs Efficiently - Added by Anonymous about 16 years ago

Legacy ID: #4940773 Legacy Poster: Michael Kay (mhkay)

>Would it be possible to add a note about the memory management characteristics of Transform to the Javadocs? I looked there but didn't see any obvious guidance there. It's very easy to add such a note, but it's very hard to put it somewhere where it will be found by people who need to see it. This is particularly true I think when you're implementing a method defined in some external API: people look at the Javadoc for the interfaces and not for their implementing classes. There's currently information at http://www.saxonica.com/documentation/javadoc/net/sf/saxon/Controller.html#clearDocumentPool() Michael Kay

    (1-3/3)

    Please register to reply