transform performance for large xml files
Added by Altuğ Bayram over 10 years ago
Hi,
I am using net.sf.saxon.s9api for schematron validation. My source xml documents are 100MB - 200MB. It is taking to much loading xml file to XdmNode. It is approx. 5 minutes.
Transformation also taking minutes.
Is it possible to make this faster?
My source code as below.
XdmNode xmlsource = proc.newDocumentBuilder().build(new StreamSource(bios));
ByteArrayOutputStream svbaos = new ByteArrayOutputStream();
Serializer out = proc.newSerializer(svbaos);
transformer.setDestination(out);
proc.writeXdmValue(xmlsource , transformer);
Thanks and regards,
Altug
Replies (13)
Please register to reply
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
Neither the parsing nor the transformation should take anything near to that long.
It's impossible to see why it's taking that long without more information.
Are you fetching resources such as DTDs from remote sites perhaps?
RE: transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
It is already loaded to ByteArrayInputStream and made xsd check successfully before schematron validation. You can see the xml first part below.
href="../xslt/kebir.xslt"?><edefter:defter xmlns:edefter=" http://www.edefter.gov.tr" xmlns:xades="http://uri.etsi.org/01903/v1.3.2#" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.edefter.gov.tr
../xsd/edefter.xsd"><xbrli:xbrl xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:iso639=" http://www.xbrl.org/2005/iso639" xmlns:link=" http://www.xbrl.org/2003/linkbase" xmlns:gl-bus=" http://www.xbrl.org/int/gl/bus/2006-10-25" xmlns:xlink=" http://www.w3.org/1999/xlink" xmlns:gl-cor=" http://www.xbrl.org/int/gl/cor/2006-10-25" xmlns:gl-plt=" http://www.xbrl.org/int/gl/plt/2006-10-25"xmlns:iso4217=" http://www.xbrl.org/2003/iso4217">
<link:schemaRef xlink:href="
../xsd/2006-10-25/plt/case-c-b/gl-plt-2006-10-25.xsd" xlink:type="simple"/>
<xbrli:context id="ledger_context">
<xbrli:entity>
<xbrli:identifier scheme="http://www.gib.gov.tr">1234567890</
xbrli:identifier>
</xbrli:entity>
<xbrli:period>
<xbrli:instant>2014-05-13</xbrli:instant>
</xbrli:period>
</xbrli:context>
<xbrli:unit id="try">
<xbrli:measure>iso4217:TRY</xbrli:measure>
</xbrli:unit>
<xbrli:unit id="countable">
<xbrli:measure>xbrli:pure</xbrli:measure>
</xbrli:unit>
<gl-cor:accountingEntries>
<gl-cor:documentInfo>
<gl-cor:entriesType contextRef="ledger_context">ledger</
gl-cor:entriesType>
<gl-cor:uniqueID contextRef="ledger_context">KEB2014000001</
gl-cor:uniqueID>
<gl-cor:language contextRef="ledger_context">iso639:tr</
gl-cor:language>
<gl-cor:creationDate contextRef="ledger_context">2014-05-13</
gl-cor:creationDate>
2014-05-23 15:31 GMT+03:00 Saxonica Developer Community < dropbox+saxonica+f38e@plan.io>:
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
Sorry, but it's very rarely possible to solve performance issues just by looking at small parts of the code. There are two ways of proceeding: either create a reproducible test case that allows us to reproduce the problem ourselves, or do some performance investigation yourself to see where the time is going (e.g. by running under visual VM). Preferably both.
It would also be good to separate the problems more clearly. Is the problem in XML parsing performance or in transformation performance? Solving performance problems always involves a process of divide & conquer, so let's try and answer that question first.
RE: transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
Hi,
I have done same xsl transform test from commandline (net.sf.saxon.Transform class).
The XML source file 121MB, the xsl file 83 KB. Transformation taking 9 minutes.
The machine: Intel Xeon 5050 2 Core 3 Ghz, 18 GB RAM
Is this normal for saxon?
2014-05-23 18:33 GMT+03:00 Saxonica Developer Community < dropbox+saxonica+f38e@plan.io>:
RE: transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
I have run the same test with -t parameterer. Result as below.
Saxon-HE 9.5.1.3J from Saxonica Java version 1.6.0_45 Stylesheet compilation time: 1917 milliseconds Processing file:/C:/SaxonHE9-5-1-3J/xyz.xml Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/C:/SaxonHE9-5-1-3J/xyz.xml using class net.sf.saxon.tree.tiny.TinyBuilder Tree built in 9667 milliseconds Tree size: 3961749 nodes, 13264144 characters, 1376585 attributes Execution time: 10m 41.522s (641522ms) Memory used: 720861504 NamePool contents: 145 entries in 135 chains. 15 URIs
2014-05-26 16:07 GMT+03:00 Saxonica Developer Community < dropbox+saxonica+f38e@plan.io>:
transform performance for large xml files - Added by Michael Kay over 10 years ago
As I tried to explain already, with performance problems, the devil is in the detail.
Your question is like asking "I've got a Java program that takes ten minutes to run, is this normal?". Answer, it depends what it is doing. Without looking at the code, I can't possibly tell.
Michael Kay Saxonica
transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
Hi Michael,
It is not my code anymore. It is commandline of the tool. http://www.saxonica.com/documentation/using-xsl/commandline.html
But if it is also not helping to solve this issue, I can send the xml and xsl file. Can you please share your email adress?
2014-05-26 17:04 GMT+03:00 Saxonica Developer Community < dropbox+saxonica+f38e@plan.io>:
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
You're welcome to post your XML and XSLT code to support at saxonica dot com
RE: transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
I sent the files to the email.
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
I ran it with -TP:profile.html and this shows over 90% of the time spent in one template rule matching
/edefter:defter/xbrli:xbrl/gl-cor:accountingEntries/gl-cor:entryHeader/gl-cor:entryDetail
Unfortunately this template rule is over 250 lines long so that doesn't pinpoint the problem very precisely. However, a quick scan reveals this condition:
<xsl:when test="not(preceding-sibling::node()) or not(preceding-sibling::node()/gl-cor:lineNumberCounter) or not(gl-cor:lineNumberCounter) or xs:decimal(gl-cor:lineNumberCounter) >= max(preceding-sibling::node()/xs:decimal(gl-cor:lineNumberCounter))"/>
which is likely to be very expensive if the number of preceding sibling nodes is high (in particular, the performance is quadratic in the number of siblings). A quick query showed that the gl-cor:entryHeader element can have as many as 1153 children, so this is clearly your bottleneck.
Java profiling also showed much of the time being spent navigating the preceding-sibling axis and converting from string to xs:decimal.
You could convert this from a quadratic to a linear algorithm by using recursive traversal of the children of the entryHeader element, keeping track of max(lineNumberCounter) as you go (though if this is schematron-generated code, I wouldn't know how to change what schematron generates.)
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
It occurs to me that this condition
xs:decimal(gl-cor:lineNumberCounter) >= max(preceding-sibling::node()/xs:decimal(gl-cor:lineNumberCounter))
is saying that each lineNumberCounter must be >= than the max of all the previous lineNumberCounters, which will only be true if each lineNumberCounter is >= the most recent one, that is, if they are monotonically increasing. So the condition can be replaced by the much more efficient
xs:decimal(gl-cor:lineNumberCounter) >= xs:decimal(preceding-sibling::*[gl-cor:lineNumberCounter][1]/gl-cor:lineNumberCounter)
This reduces the execution time to 22 seconds.
RE: transform performance for large xml files - Added by Michael Kay over 10 years ago
Also, double arithmetic is faster than decimal arithmetic, so if it's appropriate to the problem, use doubles. (I would have actually expected a lineNumberCounter to be an integer, but that's guesswork).
RE: transform performance for large xml files - Added by Altuğ Bayram over 10 years ago
Hi Michael,
We have applied your suggestion to the schematron rules. It worked approx 50 seconds which is wonderful for us.
We are grateful for your help and guidance.
Kind regards,
Altug Bayram
Please register to reply