thinning a tree (or, optimizing in XQuery)
Added by Anonymous over 16 years ago
Legacy ID: #4992339 Legacy Poster: Chapman Flack (jcflack)
Hi, I'm trying to write a function for what may be a common task (others may already have done this - if so I would love to see another solution). Given a great big tree, and a node sequence consisting only of the leaf nodes that satisfy some interesting filter (like //leaf[@flavor eq 'eucalyptus']), I want to return a tree of the same structure but containing only the interesting leaf nodes and the necessary nonleaf nodes (the ones on paths to selected leaves). My first attempt was bottom-up and tail-recursive (reverse $selected-leaves to go through it in reverse document order, constructing new nonleaf nodes as needed; reverse document order ensuring that no nonleaf node gets constructed before any nonleaf children it needs). It worked but exhausted memory on large inputs. (The culprit may have been the mapping I had to maintain from original to newly constructed nonleaf nodes, I'm not sure.) Then I tried a straightforward top-down approach, starting with the root element and constructing a new copy that includes only its children that are in $selected-leaves or in $needed-nonleaves ($selected-leaves/ancestor::element(), which can be computed once at the outset). Its nonleaf children are constructed by ordinary non-tail recursive calls. That approach was more memory-friendly (the tree is often big but rarely deep) but much less clock-friendly. 1. I discovered that the TimedTraceListener only records calls to my user functions. Is it possible to get a little lower and see timings for different operations in my functions? Does that require writing a custom TraceListener? I noticed that I got about a 4x speed improvement by rephrasing let $children := $n/ let $nonleaf-children := $children intersect $needed-nonleaves let $leaf-children := $children intersect $selected-leaves as let $nonleaf-children := $needed-nonleaves[.. is $n] let $leaf-children := $selected-leaves[.. is $n] but it would be nice to use profile data to see what I should try hardest to rephrase. 2. I can probably save a lot of time in the recursive calls by passing only the subsequences of $selected-leaves and $needed-nonleaves that consist of descendants of the node to be processed - IF it's possible to compute those subsequences at each call with less work than is wasted now by applying [.. is $n] to the entire sequence every time, and IF I can do it without the subsequences having to be separately realized in memory. Is there an idiom that's especially efficient in Saxon to express the equivalent of $some-sequence[ancestor::node() = $some-node], that is, the elements of $some-sequence descended from $some-node? With the sequence in document order, the subsequence should be contiguous, so it could be characterized by a left and right index managed explicitly and passed to recursive calls, which could have promise as far as not using memory to realize the subsequences. But it's an awfully procedural way of thinking, and I'm wondering if there's a more XQueryish way to say it that Saxon will nevertheless carry out with similar efficiency. Thanks, Chapman Flack
Replies (3)
Please register to reply
RE: thinning a tree (or, optimizing in XQuery - Added by Anonymous over 16 years ago
Legacy ID: #4992675 Legacy Poster: Michael Kay (mhkay)
I personally find this kind of task MUCH easier in XSLT, which has built-in primitives for doing a recursive descent of the tree. It's also easier, I think, to make it efficient - the key to the performance of this kind of operation is to make sure you avoid repeated copying of subtrees, which will happen if the function calls are executed in push mode rather than pull mode. I suspect your "bottom-up" code was copying subtrees, and the "top-down" code wasn't, but without seeing your code I can't tell. 1. I discovered that the TimedTraceListener only records calls to my user functions. Is it possible to get a little lower and see timings for different operations in my functions? Does that require writing a custom TraceListener? You can adapt the TimedTraceListener to record more information, but I think you'll find that the information becomes unreliable at a finer level of granularity. Lazy evaluation makes it very hard to decipher the results - you can't easily distinguish the cost of doing an intersect operation from the cost of evaluating its operands, because the whole thing is pipelined. >Is there an idiom that's especially efficient in Saxon to express the equivalent of $some-sequence[ancestor::node() = $some-node], that is, the elements of $some-sequence descended from $some-node? I'm sure "=" for "is" (or "intersect") there was a slip... But it's a common mistake, and because it often gives the right answer people don't realize how costly it is (as well as being wrong, when given suitable data). It can result in comparison of very long string-values, which take a long time to assemble from all the descendant text nodes. The answer is no, there's no particularly good way of doing this. With some tree models, notably the TinyTree, there would be good strategies available, but Saxon generally avoids doing any optimizations that take account of which tree model is in use. (In general, this isn't even known at compile time.) The optimimum strategy may well be different depending on whether your filter is selecting 5% of leaf nodes or 95% of them. In the latter case you would be better off building a list of nodes that DON'T have any selected descendants rather than a list of those that DO.
RE: thinning a tree (or, optimizing in XQuery - Added by Anonymous over 16 years ago
Legacy ID: #4992700 Legacy Poster: Michael Kay (mhkay)
>>Is there an idiom that's especially efficient in Saxon to express the >equivalent of $some-sequence[ancestor::node() = $some-node], that is, the elements of $some-sequence descended from $some-node? One construct you could experiment with is let $start = $some-node let $end = $some-node/following::node()[1] return $some-sequence[. >> $start and not($end << .)] That should work quite well on the tinytree where "<<" and ">>" are pretty fast. (the peculiar way of coding it is so that it still works when $start is the root of the tree.) Michael Kay
RE: thinning a tree (or, optimizing in XQuery - Added by Anonymous over 16 years ago
Legacy ID: #4994250 Legacy Poster: Chapman Flack (jcflack)
> One construct you could experiment with is > let $start = $some-node > let $end = $some-node/following::node()[1] > return $some-sequence[. >> $start and not($end << .)] I'm happy to report that's exactly the construct I did experiment with yesterday while waiting for replies (though mine had an 'if' because I hadn't thought of not($end<<.) - that's nicer). It gave me a 15x speedup on my test document, on top of the 4x from replacing the 'intersect's with '.. is $n's earlier. A factor of 60 improvement makes a pretty fair answer to "how was work yesterday?". :) You were right that my = in the question was a slip (in the sense that I had never coded up and tested a version that used that) but I am glad you pointed it out because I had still not completely absorbed the fact that the general comparisons always atomize, and that 'intersect' there would have been the way to get the equivalent result based on node identity. Thanks, -Chap
Please register to reply