Writing better XQuery scripts

Added by George Sofianos about 8 years ago

We have some issues with maintaining some old xquery scripts (v 1.0) and writing new ones, that can perform fast on Saxon. The scripts are somewhat complex, for example they might use about 20 modules, and each of the modules serves a different purpose.

For example some are used as a framework for generating an HTML result, others to retrieve data from external sources like SPARQL endpoints, RDF/XML files, others contain functions to validate the data that are being retrieved, etc. The final scripts can be up to 5000 rows. We usually use ant or a similar technology to combine these modules into one file, just for deployment purposes, but this step is not necessary for our workflow.

The problem is that these scripts tend to be really slow, and also very memory hungry. So we are looking for ways to improve (a lot) these scripts. We generate a lot of XML in the scripts, that are then used as arguments in other functions in the modules. I suspect this is one of the things that slows down our queries.

I have been reading the "Ten reasons why Saxon is fast" at http://sites.computer.org/debull/A08dec/saxonica.pdf and there is a bullet that says "try to educate users on how to write code that works well on your product, but recognize that you will only reach a small minority". I've been keep looking everywhere and I can't find general rules that will improve performance. We have some scripts that can take hours to run on a 30+Mb file, and we can have cases of files over 300Mb, so at the moment we can't execute our scripts. I've tried to run with optimisation disabled but that doesn't change anything. I've also tried evaluating EE version, but I don't see any improvement on these specific scripts. I think our issue is fairly complex since it involves a lot of processing and data manipulation, so I'm just asking for things that we might miss that might help us to create more efficient scripts from a CPU and RAM perspective. Thanks

Replies (1)

RE: Writing better XQuery scripts - Added by Michael Kay about 8 years ago

With performance, the devil is always in the detail. So performance analysis is a process of finding the details that matter. Here's what I would do to investigate:

(a) run with -t to get a top-level breakdown between compile time, document build time, and query execution time.

(b) run with -TP:profile.html to see if the slowness can be localized to a particular function in your query. (I've just tried this and the output for XQuery is rather messy, but it's usable).

(c) run with different source document sizes and establish what the relationship is between document size and execution time. Is it linear, quadratic, or worse?

Next steps depend on what this reveals. If a query is taking hours to process 30Mb of input then I would suspect it is quadratic in data size, and steps (b) and (c) above will usually isolate this very quickly. Very often the Saxon-EE optimizer will reduce a quadratic query to a linear one (by use of hash joins, etc) but not always.

The main thing that I always teach on performance courses is to focus your efforts on measurement. If you have a theory, devise a measurement to test it; use the measurements to discard wrong theories and develop new ones. Once you understand where the performance is going, fixing the problem is usually (but not always) easy.

A powerful technique is what I call "subtractive measurement". If you suspect something is taking a long time, stop doing it, and see if that makes a difference.

(1-1/1)

Please register to reply

Project

Profile

Help

Saxon