Saxon EE: Caching indexes across XQ calls

Added by Anonymous almost 14 years ago

Legacy ID: #10039417 Legacy Poster: David Lee (daldei)

I'm trying out Saxon EE in hopes of performance improvements. One use case I have pre-loads several large XML files then passes the parsed document into saxon XQuery multiple times (as external parameters) once for each of many xquery files. These xquerys make some heavy use of joins. I'm seeing a slight improvement in EE over HE (about 30 seconds out of 5 minutes) but was wondering if it can be improved. One suspician I have is that indexes created by one invocation of XQuery are not persisted. So that if I call multiple XQuery's with the same parsed documents it cant make use of previously created indexes. Is this correct ? If true ... Do you have any suggestions on how I might be able to persist the index cache from one xquery run to another ? Where are they stored ? Thanks for any pointers. -David Lee

Replies (9)

Please register to reply

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10039556 Legacy Poster: Michael Kay (mhkay)

There are two kinds of index created by the Saxon-EE optimizer: document-level indexes and variable-level indexes. A document-level index is created when there is a construct such as /a/b/c[d=$x] where the path expression is rooted at a document node. You will be able to spot such indexes in the -explain output by the presence of a key definition and calls on the key() function using key names such as "kk:k001". The other kind of index applies to a variable - either a variable that's explicit in the query, or one injected by the Saxon optimizer itself. This typically supports an expression such as $var[a=$x] where $var is some sequence of nodes not rooted at a document. Variable-level indexes are transient, and exist only while the variable is in scope; they will never be shared or reused across multiple query evaluations. Document-level indexes should - at least in principle - remain in memory so long as both the document and the compiled query are in memory. Once either the document or the query is garbage collected, the index should disappear. Having said that, however, I know this works in XSLT but I'm not sure I have firm evidence that it works in XQuery. I don't think that two different queries will ever share an index, though I'd need to check the code to be sure.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10041594 Legacy Poster: David Lee (daldei)

Thanks, thats about what I deduced (and probably what I would have implemented myself). Unfortunately this optimization strategy is in opposition to my development strategy of breaking up big tasks into smaller ones, by using many small XQuery programs and passing the parsed documents over many queries. I think this robs the optimizer of the obvious chances of caching indexing and other such things. On the other hand it really helps in other ways by being able to debug smaller programs and passing around pre-parsed documents to multiple modules. And using Saxon HE this strategy works extremely well. Its just in Saxon EE it doesn't work much better. I suggest food-for-thought for a future enhancement. Imagine a context between the Static and Dynamic context where indexes can be cached and preserved across xslt or xquery calls providing the same documents are bound to external variables or the input context. Alternatively the documents or XDM Values themselves (NodeInfo? XdmItem ?) could store the index caches so that different calls to xslt/xpath/xquery with reused values can share from previous incarnations of indexing and optimization. Imagine a system that holds onto the XDM representation rather then serializes it at every operation. I know this is not the normal use case (of parse / process / serialize ) but suggest it could be a very valuable use case as more applications and frameworks integrate to saxon at the XDM level instead of the serialized text level. And maybe not actually that difficult to implement (yes I know I'm being a 'backseat quarterback' .... but maybe ?) Anyway thanks for the answers, it gives me places to look. -David

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10044398 Legacy Poster: Michael Kay (mhkay)

You can of course organize a pipeline in which XDM instances (trees) are passed from one query to another, rather than being parsed and serialized at each stage. I would hope you are doing that, because it makes a big difference. I've been looking at the code and I think it might not be too difficult to reuse index definitions under the control of the same KeyManager. The difficulty is designing a usable system in which the KeyManager has wider scope than a single query or stylesheet, but smaller scope than an entire Configuration. I'd be reluctant to put the KeyManager at the Configuration level because it could grow indefinitely in some workload scenarios. If it can only be achieved using some extra abstract object visible in the API (say an Application object) then very few users are ever going to discover how to take advantage of it.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045138 Legacy Poster: David Lee (daldei)

Yes a "pipeline" is similar to what I'm doing. In my case (using xmlsh) this pattern is very common (atleast for me) xread doc1 < file.xml xread doc2 < file2.xml .... for i in *.xquery ; do output=$(xfile -b $i) xquery -f $i -v doc1 $doc1 doc2 $doc2 > ${output}.xml done This pre-parses multiple files, then reuses them across multiple xquery calls passed in as external parameters (of type document-node()) In this case producing different "sections" or "pages" of output one per xquery. It works quite well, and gets around the absence of a store() command in xquery, but also from a programmers perspective I find it easier to write and debug then one huge xquery (or xslt) file. And all in the same JVM its as efficient. Before xmlsh I would use the same pattern in 'pure java' so I suggest this is not just specific to xmlsh, its a generic use case that I suspect is in wide use by people trying to optimize xml processing. EXCEPT when I throw EE at it, as discussed, it looses any indexing optimization information collected in each iteration. Yes this KeyManager object was the idea I had first, but you are right it may be tricky to use so limited. And yes putting it in the global configuration may grow indefinitely (unless you made use of some tricky use of weak references, but still ... not great). Plus the global configuration is sharable across threads , adding dynamic data to it would break that (unless you used thread local storage, but then it would break if you tried to pass documents across threads, your screwed both ways). So my last thought, put the accumulated key data in the documents (NodeInfo) themselves ! then they accumulate as long as that document is present. However it then breaks the thread safe model of NodeInfo ... which might be patched (like configuration) using thread local storage .... but same issue, passing a NodeInfo to a new thread even if not concurrently would then lose the key cache data. Solving the threading, scope, and usability factors simultaneously is indeed an interesting challenge. I suspect some level of user control may be needed to balance these requirements. Multithreading in particular. Sometimes threads are used to access the same objects sequentially , or sometimes concurrently ... or sometimes never share objects across threads. Which case to optimize for may be hard to second-guess without intervention (and could change in the lifecycle of an application or on an object basis). Thanks for looking at it.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045268 Legacy Poster: Michael Kay (mhkay)

Thanks for the ideas. One approach might be for DocumentBuilder to offer a method buildIndex(doc, match, use) where match and use behave like the corresponding attributes of xsl:key; the method would construct an index that lives as long as doc lives, and then XQuery/XSLT before building an index for a document would check to see if a suitable one already exists. I don't think it would be stretching the spec too far to allow the index to be named, so it could be invoked explicitly by calling the key() function.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045314 Legacy Poster: David Lee (daldei)

I think that would be very useful. In addition, I could imagine a flag on the invocation of xpath/xquery/xslt which indicates if it is "OK" to add dynamically generated indexes into the document itself passed in as either input context or external variables (or any XdmValue, imagine a sequence of nodes that could be indexed but not wrapped in a document). Like your suggestion, the client would be responsible for knowing if the documents were thread-safe or not, and of course the default behaviour would be unchanged (treat input as read-only)

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045584 Legacy Poster: Michael Kay (mhkay)

Like your suggestion, the client would be responsible for knowing if the documents were thread-safe or not, and of course the default behaviour would be unchanged (treat input as read-only) Actually, Saxon (all editions) already builds document-level indexes to support the id() function and to support //X expressions: in both cases these are built on first use, and retained for the life of the document. So there's certainly a precedent. Thread-safety is handled pragmatically, on the basis that it doesn't matter if two people build the same index at the same time and only one of them succeeds in attaching it to the document node. The difference is that the number of id() and //X indexes is bounded, whereas the potential number of filter-optimization indexes is unbounded, and the probability of them being reused in subsequent queries is probably rather lower.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045592 Legacy Poster: Michael Kay (mhkay)

and the probability of them being reused in subsequent queries is probably rather lower. So perhaps this is a case where a true cache might be in order, with indexes being dropped on a least-recently-used basis. The management overhead could be appreciable, though.

RE: Saxon EE: Caching indexes across XQ calls - Added by Anonymous almost 14 years ago

Legacy ID: #10045742 Legacy Poster: David Lee (daldei)

and the probability of them being reused in subsequent queries is probably rather lower. I suggest you cannot reasonably assert this. While a document lives in memory what are the "probabilities" of certain expressions being re-evaluated ? I would not assume it is any higher or lower simply because a different execution script was performed (xquery/xslt/xpath). My 'gut feeling' (again, not something I could assert) is that the opposite is true. That a document if kept in memory probably holds data which is used in multiple places, hence the low level queries against it may well be reused as much in one script as another. Thats certainly a common use case I have where indexing would be valuable. Imagine a "dictionary" type document that different queries need random access to. The low level queries are largely the same across 'scripts' (xquery/xpath/xslt programs. In fact the reason its chosen to be held in memory instead of reparsed is precisely this. So I would suggest that precisely because a user chooses to keep a document in memory is a hint that many (possibly similar) queries against it are expected across multiple script invocations. Otherwise why bother pre-parsing it ? But again I cant prove any of this, just guessing. So your statement about a 'true cache' I think is the right way of thinking about this if the indexing overhead is truely unbounded. In fact I suggest this is true reguardless of where you store the indexes and if they are preserved across scripts or within one. An application workflow process will likely be doing approximately the same amount of document access in the lifetime of the app regardless of how the actual xquery/xslt/xpath scripts are divided up. So the unboundedness and indexing/caching problems are the same at a query or a document level. ( but yes thats supposition, not proof. However I think it points to the concept of a true cache regardless of where its stored). I've had very good luck using Java Weak References to implement this kind of cache. However to do so in a library instead of an application is somewhat 'egotistical' in that your making the assumption that its OK for your piece to take over as much memory as you want. Not so bad as hard-reference though. I've had good luck with a simple bounded weak-reference backed 'most frequently used' array. That leverages weak references in the case you have plenty of VM but bounds their use so you dont overdo it. Anyway thanks for thinking about this. -David

(1-9/9)

Please register to reply

Project

Profile

Help

Saxon