Project

Profile

Help

Join Performance on Saxon

Added by Anonymous over 17 years ago

Legacy ID: #4412581 Legacy Poster: Mike Jones (firenike)

Hi- I was working with a friend on this problem: We have the following document: <?xml version="1.0" encoding="UTF-8"?> <root> <class id="1"> <teacher>5</teacher> </class> <class id="2"> <teacher>9</teacher> </class> <teacher id="5"> <gender>female</gender> </teacher> <teacher id="9"> <gender>male</gender> </teacher> </root> To select all of the class taught by female teachers, we use the query: class[teacher = ../teacher[gender = 'female']/@id] This seems to work. However, if we create 1500 copies of the class with id="2" and 1500 copies of the teacher with id="9" then the query takes almost 30 seconds to complete with saxon 8 on a 2Ghz Core 2 Duo even though the query only returns 1 row. If instead of 1500 classes x 1500 teachers, there are 1500x1 or 1x1500, then the query completes in less than a few seconds. Unfortunately we have not tested this on another engine, but I was wondering if anyone knew offhand if this performance is to be expected, and if anyone knew if there was a more efficient query. This was originally posed on the mulberry list, but I ask it here since it may be performance related to the engine. Mike


Replies (4)

Please register to reply

RE: Join Performance on Saxon - Added by Anonymous over 17 years ago

Legacy ID: #4412653 Legacy Poster: Michael Kay (mhkay)

Generally speaking, Saxon-SA optimizes join performance whereas Saxon-B always uses a nested loop. In this case however, I don't think Saxon-SA will optimize it either. Saxon won't recognize that because all the class elements are siblings, the expression ../teacher[gender = 'female']/@id is actually selecting the same nodes each time. It works out that both operands of the "=" operator depend on the context item and it therefore evaluates the predicate independently for each class. Note sure whether you're using XSLT, XQuery, or standalone XPath, but whichever it is, it should be easy enough to rewrite it as something like let $femaleTeachers := teacher[gender = 'female']/@id return class[teacher = $femaleTeachers] and Saxon-SA will then do a hash join. It would be possible of course to add the logic to the optimizer to detect this case, but before recognizing a particular pattern in the optimizer one needs some confidence that it's going to benefit a worthwhile number of queries.

RE: Join Performance on Saxon - Added by Anonymous over 17 years ago

Legacy ID: #4413050 Legacy Poster: Mike Jones (firenike)

Hi- The only problem with rewriting the statement with a separate variable for each id list is that we are building something which allows a user to browse a data set by dynamically applying any number of filters, so the user may specify construct queries such as: Show every class* where: * the teacher* has a mother* who works for a company* which is french+. and * a student* has a friend* who has a mother* who works for a company* which is english+. Asterisks(*) denote ids which must be dereferenced and pluses (+) denote direct children. We are using XSLT on Saxon-B to transform the query, and if we were allowed to use the original construct, then we could easily transform the the first clause into the following string: //class[teacher=//teacher[mother=//mother[company=//company[nationality='french']/@id]/@id]/@id] and then we could just saxon:evaluate this to retrieve the results. Is there a way we could achieve something similar using the technique of generating an id list (so that we can avoid the performance hit)? It seems like we would need to do a complex double transformation that would transform the xpath above into a sequence of steps where xslt statements set an id list variable before each dereferencing using xpath. Is there a way to rewrite or restructure so that we can get saxon-sa to optimize the statement? Also, we use "joins" elsewhere that we would like to optimize- is there somewhere you can point us so that we can study what exactly are the joins that will be optimized by saxon-sa before purchasing? and just to confirm- we can purchase xslt-sa version (not the full package) to benefit from these optimizations? thanks very much.

RE: Join Performance on Saxon - Added by Anonymous over 17 years ago

Legacy ID: #4413839 Legacy Poster: Michael Kay (mhkay)

I think this construct should be optimized OK: //class[teacher=//teacher[mother=//mother[company=//company[nationality='french']/@id]/@id]/@id] Given c[teacher = ../path] Saxon doesn't index it because it doesn't recognize that "c" is a sibling node-set and that ../path is therefore constant for all c nodes. However, given c[teacher = //path] Saxon-SA does recognize that "c" is a "single-document node-set" - a set of nodes from a single document - and that //path is therefore the same for every node in c, which means it can be pulled out of the predicate. Predicates applied to an expression rooted at a document node are also particularly effective because the indexes are held along with the document node and are thus relatively reusable.

RE: Join Performance on Saxon - Added by Anonymous over 17 years ago

Legacy ID: #4413844 Legacy Poster: Michael Kay (mhkay)

Regarding the other part of the question - how can you tell which joins are optimized - it's very difficult to write down the exact rules that are followed, because the original expression goes through the optimizer a number of times, so it's not just a question of describing the expression as originally written, but the expression as it exists after previous stages of optimization. You can though use saxon:explain="yes" to see what the optimizer has done with an expression - though the output takes a little bit of deciphering (it will be much improved in the next release). Look for indications that an expression is "indexed" or that it has been replaced by a call on the key() function with a system-allocated key name. Saxon-SA does not actually recognize or optimize joins per se. What it does is to optimize filter expressions of the form A[B=C] where either B or C, but not both, are independent of the context item. It does this by creating an index over the value of A such that the items with a particular value of B can be found quickly. This is done only if the filter expression appears within a nested loop structure so that the index can be reused. This nested loop structure is the typical representation of an unoptimized join.

    (1-4/4)

    Please register to reply