Project

Profile

Help

Retrieving "largest" Element...

Added by Anonymous almost 14 years ago

Legacy ID: #8504717 Legacy Poster: Tobias Klevenz (tklevenz)

Hello, I'm trying to do the following: I want a list of Elements from any XML file. Each Element should only be listed once and only the one with the longest String should be listed. Since the files I'll be transforming can be around a gigabyte, Streaming is kinda mandatory. Here is what I came up with: [code]<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" version="2.0" xmlns:saxon="http://saxon.sf.net/" extension-element-prefixes="saxon"> <xsl:variable name="var" saxon:assignable="yes"> </xsl:variable> <xsl:template match="/"> <xsl:variable name="ns" select="namespace-uri(.)"/> <xsl:call-template name="iterate"> <xsl:with-param name="ns" select="$ns"/> </xsl:call-template> </xsl:template> <xsl:template name="iterate"> <xsl:param name="ns"/> <saxon:iterate select="saxon:stream(doc('10000K.xml')//[.=text()])"> <xsl:variable name="text" select="."/> <xsl:variable name="length" select="string-length($text)"/> <xsl:variable name="elem" select="name()"/> <xsl:variable name="var2"> xsl:choose <xsl:when test="not($var/elements/[name()=$elem][string-length() > $length])"> <xsl:copy-of select="."/> <xsl:copy-of select="$var/elements/[not(name()=$elem)]"/> </xsl:when> xsl:otherwise <xsl:copy-of select="$var/elements/"/> </xsl:otherwise> </xsl:choose> </xsl:variable> <saxon:assign name="var"> <xsl:copy-of select="$var2"/> </saxon:assign> </saxon:iterate> <xsl:copy-of select="$var/elements/*"/> </xsl:template> </xsl:stylesheet>[/code] I have never used the "Global Variable" concept before since I didn't know it existed but I always missed that in XSLT compared to other Languages. I'm running that code in Oxygen with Saxon9ee and with a 10mb sized file it allready takes about 150s, so I'm guessing with a 1GB sized file it must take hours. Any Idea how to speed that up? Thanks, Tobias


Replies (5)

Please register to reply

RE: Retrieving &quot;largest&quot; Element... - Added by Anonymous almost 14 years ago

Legacy ID: #8504871 Legacy Poster: Michael Müller-Hillebrand (mmh5)

Tobias, You are most definitely on the wrong track. Your usage of Saxon-specific programming constructs coming from procedural programming seems to be completely unnecessary. You just want to check each element node for the length of text content and its name. This could be done using recursion, I guess. Using <xsl:template match="*"> you could visit each element node, and using <xsl:param ...>/<xsl:with-param ...> you could pass-on the names of already seen nodes. Your chances to get a good hint into the correct direction are bigger in the regular xsl-list, therefore you should try to build a solution without Saxon extensions. - Michael

RE: Retrieving &quot;largest&quot; Element... - Added by Anonymous almost 14 years ago

Legacy ID: #8505035 Legacy Poster: Tobias Klevenz (tklevenz)

I wouldn't have a Problem at all using tools from the standard xsl-list, my Problem is the large filesize. When I do a <xsl:template match="*"> on a 1GB file wouldn't I still be building the DOM-tree? Which I thought would be slower than streaming, if possible at all with the available memory on my machine. - Tobias

RE: Retrieving &quot;largest&quot; Element... - Added by Anonymous almost 14 years ago

Legacy ID: #8505879 Legacy Poster: Michael Kay (mhkay)

To get a streamed transformation here, you do indeed need the saxon:iterate extension (which anticipates xsl:iterate in the XSLT 2.1 specification. However, you don't need saxon:assign - the idea is that with saxon:iterate, you can set the parameter values for the next iteration using xsl:with-param within saxon:continue, and pick up these new values in an xsl:param nested within the saxon:iterate call. However, this won't directly help with the performance problem, which is caused by the large xsl:copy-of used to copy the intermediate results on each iteration (well, on many iterations). The problem is that the intermediate results grow larger and larger. One solution is to call out to Java and use a mutable HashMap to hold a mapping from element names to the size largest element of that name so far encountered. Or you could use your current approach, but modified so that you never have more than one value for each element name in the intermediate results: this means that instead of simply appending a new element to the results, you should do a transformation on the histogram document to replace the element holding the previous largest value by an element holding the new largest value. This way the size of the copy is bounded by the number of distinct element names in the document vocabulary, and tassuming random data the number of times the structure is copied is statistically 1/1 + 1/2 + 1/3 + 1/4 ... 1/n where n is the number of element instances in the source; this rises as n increases, but only slowly, so the overall performance might be effectlvely linear.

RE: Retrieving &quot;largest&quot; Element... - Added by Anonymous almost 14 years ago

Legacy ID: #8506352 Legacy Poster: Michael Kay (mhkay)

There's another problem here that I didn't spot yesterday. Your selection doc()//* includes all elements in the document including the outermost element. When you do //[. = text()] it needs to evaluate the string value of the element, which for the outermost element will be about a gigabyte in size. I think that what you are actually trying to do is to select elements that have no element children - which isn't easy either in streaming mode. But a good start might be to exclude the outermost element from the search, so it becomes ///*.

RE: Retrieving &quot;largest&quot; Element... - Added by Anonymous almost 14 years ago

Legacy ID: #8506579 Legacy Poster: Tobias Klevenz (tklevenz)

Hi Michael, I spotted that last one myself when I looked at the code again yesterday. :) Thanks a lot for your insight though, I'll try to avoid the copy-of and modify my current approach the way you suggested, since this is also kind of a proof of concept for me I wanted to stick to xslt. I find that simple sounding problem quite interesting looking at it from an xslt view. - Tobias

    (1-5/5)

    Please register to reply