parsing big documents with Saxon
Added by Anonymous over 19 years ago
Legacy ID: #3120839 Legacy Poster: sLabriki (slabriki)
Hi, First I want to appologize for my English. I'de have downloaded the evaluation copy from saxonica.com For my project, I need to parse files that have more than 800 Mo. I tried to pass this command : java -jar saxon8.jar source.xml transform.xslt > result .xml But it doesn't work. The message error is : exception in thread "main" java.lang.OutOfMemoryError I tried also that : java -jar -ms256m -mx256m (and with other values) but while retrying the saxon command, the error message apears again. So how can I parse big documents with Saxon ? what is the solution ? Thank you in advance. Salim Labriki Switzerland
Replies (3)
Please register to reply
RE: parsing big documents with Saxon - Added by Anonymous over 19 years ago
Legacy ID: #3120874 Legacy Poster: Michael Kay (mhkay)
Usually you can get the transformation to succeed if you allocate memory 5 times the size of the source document. In your case that would be 4Gbytes, which I suspect is not feasible. I've seen people process up to 150Mb successfully, but 800Mb is certainly stretching the limits. The only way of tackling this is by a preprocessing step that splits up the document into smaller pieces and then transforms each piece separately. That usually means writing a SAX filter in Java. Another approach might be to use STX, or to write the whole thing as a SAX application - it depends very much on the transformation you need to perform. I'm hoping over the next few weeks to experiment with using the new version of Sun's pull parser to see if this kind of preprocessing can be automated. If you are not in a desparate hurry and would like to work with me on testing this, pelase get in touch off-list. Michael Kay
RE: parsing big documents with Saxon - Added by Anonymous over 19 years ago
Legacy ID: #3123123 Legacy Poster: sLabriki (slabriki)
Thank you for your answer. I am just a student now and I don't think that my help can be useful for you, ben I thank you for your proposition. I tried to do that in one command line : java -jar -Xmx1400m -Xms1400m -jar saxon8.jar source.xml generate.xslt >result.xml The memory of JVM react normally (No OutOfMemory message error). but other errors appears in relation with validity of XML chars (). These documents are been tested before (with XMLSPY 2004) and we worked with them in TAMINO. Everything seems to be well-formed except with SAXON programm and eXist database. These are big files and I can't modify them directly (by double-cliking, writing, and saving). I must use a program. Is there any fonctinnality in Saxon that correct files and make them w3c conform ?
RE: parsing big documents with Saxon - Added by Anonymous over 19 years ago
Legacy ID: #3123145 Legacy Poster: Michael Kay (mhkay)
If there's a problem with the validity of XML characters, this is being reported by the XML parser, not by Saxon. I would write a little SAX application that checks that the file can be processed by the XML parser before attempting to transform it - that will separate the encoding problems from the memory problems, because you won't be building a tree. It could actually be an invalid character, or it could be a problem with the file encoding, that the parser is trying to decode the bytes while assuming the wrong encoding. If necessary, you might even have to write a little Java program which reads the file directly, looking for where the problem characters occur. Unless you specify otherwise, Saxon 8.x uses the default XML parser in the JVM, which is Crimson for JDK 1.4 and Xerces for JDK 1.5. Michael Kay
Please register to reply