Project

Profile

Help

xml vs json

Added by Vladimir Nesterovsky 3 months ago

I'm begging pardon in advance as this discussion is not exactly Saxon specific. I was trying to send message to xsl-list at mulberrytech.com but I get no regular responses, so it's hard for me to participate only reading archives.

So, my original question was:

Let's assume you're agnostic regarding input and output formats, so you're ready to work either with xml or with logically equivalent json input and output data.

Then we have a question: how xslt processing will compare for two logically equivalent pipelines: where one deals with xml, and other with json?

We have several hypotheses that hint on advantage of json, like: json is lighter than xml to serialize and deserialize; json stored as map(), array() and other item() are lighter than node() at runtime, in particular subtree copy has zero cost in json; templates with match patterns to some extent can be efficiently implemented for maps using lookups of functions; To prove anything we need to commit an experiment (we're going to use Saxon as engine).

So, our question to the community: is there an isolated small representative xslt around xml (along with xml files) for us to use as a model to build equivalent xslt around json?

Michael Kay has responded to it with:

All these points are valid, but I would add a couple more:

(a) XSLT 3.0 lacks a convenient way of constructing arrays.

(b) Pattern syntax for matching maps and arrays is rather limited compared with the syntax for matching nodes. This is partly because JSON lacks any concept equivalent to element names in XML: different types of object in JSON are identified by their internal structure, not by name.

(c) The data model for JSON lacks a parent/ancestor axis, which means that template rules can't access information from outer containers; instead all the information required has to be passed down using parameters (typically tunnel parameters). A further complication is that parameter values can't be accessed in match patterns, so template rules cannot match content in a context-sensitive way.

As for your specific question, I published a couple of use cases for JSON transformations at XML Prague 2016 (https://www.saxonica.com/papers/xmlprague-2016mhk.pdf) and you may find these helpful. I can probably dig out the actual files I used if you are interested.

Thanks, I'll read carefully your article, and I'm definitely interested in actual files and sample.

Ultimately, I would like to measure performance of two independent transformations that solve the same task - one using xml, other using json.

(c) The data model for JSON lacks a parent/ancestor axis, which means that template rules can't access information from outer containers;

My goal is try to see whether this can be converted into advantage, as it should allow efficient sharing of structures.


Replies (12)

Please register to reply

RE: xml vs json - Added by Michael Kay 3 months ago

It seems I saved the files at https://github.com/Saxonica/Prague2016

Good luck with the project and please share the result. The main problem with this kind of exercise is that the results are likely to be very dependent on the precise workload you decide to analyse. For example, my experiments with the XX compiler showed that (for that workload) performance is very sensitive to the number of in-scope namespaces; it would be very easy to miss that this is one of the variables determining the outcome, because of course there are many others.

RE: xml vs json - Added by Vladimir Nesterovsky 3 months ago

Than you!

I've cloned that repository and created a folder xml-vs-json.

I started my tests from preparation of xml and json data both for courses and prices.
So, there are:

that generate datasets for 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000 records. See input.

In addition I've added create-input.bat to run those stylesheets.

convert-courses-via-maps.xsl is left unchanged, while convert-courses-via-xml.xsl is changed to deal purely with xml.
Same is for convert-prices-via-maps.xsl vs convert-prices-via-xml.xsl.

I have added courses-map.bat, courses-xml.bat, prices-map.bat, prices-xml.bat. Those batch files run transformations for all different inputs.

I've run all transformations using saxon-he-11.3 for all inputs, and have found:

  • convert-prices-via-maps.xsl runs into stack overflow for big inputs;
  • initial impression is that json is still slower than xml.

I'm going to review and optimize stylesheets to avoid stack overflow and to try to make them faster, and then I'm going to compose final timing report.

RE: xml vs json - Added by Vladimir Nesterovsky 3 months ago

I have committed some fixes and optimizations, and excel with comparison charts. Here I attach it as well.

RE: xml vs json - Added by Michael Kay 3 months ago

Interesting.

I'm not very suprised that the courses inversion takes longer with JSON, given that is has to make a flattened copy of the data in its initial pass. I wonder if there's a way of avoiding that?

It's a little surprising that the XML version increases so little between 50K and 100K records. I wonder how much of the cost is parsing? I think if you supply the document as the value of a global parameter then it will only be parsed once and excluded from the measurement. But that's probably harder to achieve with the JSON version. Another way to get the parsing cost is to parse the document twice, but caching makes it a bit tricky.

In the convert-prices stylesheet, what proportion of the records are being updated? This one really should be quite efficient for both JSON and XML and that appears to be the case. A significant difference is that in the XML case, it doesn't build the result document in memory (the results are piped directly into the serializer as they are constructed), whereas in the JSON case, we're going to build the result structure in memory and then pass it to the serializer.

RE: xml vs json - Added by Michael Kay 3 months ago

An observation: we have two internal implementations of maps, the Dictionary implementation which is optimised for read-only maps with string-valued keys, and the HashTrie implementation which allows any kind of key and supports update. Maps created using json-doc() or parse-json() are always Dictionary maps, which means each map that gets updated is going to be copied to a HashTrie map first. Probably not a big deal, but it means we're probably not choosing the optimum implementation for the workload. I don't think there's currently any switch to configure this.

RE: xml vs json - Added by Vladimir Nesterovsky 3 months ago

In the convert-prices stylesheet, what proportion of the records are being updated?

I defined 11 tags with only one that is updated. Generated documents use up to half of tags. fn:random-number-generator() is used to produce tag permutations.

If I'm not wrong this means it's close to 10% of cases to update.

RE: xml vs json - Added by Vladimir Nesterovsky 3 months ago

It's a little surprising that the XML version increases so little between 50K and 100K records. I wonder how much of the cost is parsing?

I checked this and have found my typos when I refactored convert-courses-via-xml.xsl. Plimary one was that it used: <xsl:output method="text"/> as in original stylesheet.

When I changed it to method="xml" things have changed.

Attached is updated timing.

RE: xml vs json - Added by Michael Kay 3 months ago

So, remarkably close results!

RE: xml vs json - Added by Vladimir Nesterovsky 3 months ago

Yes. Also this typo clearly shows that with increase of size of data IO dominates over implementation technique.

RE: xml vs json - Added by Vladimir Nesterovsky about 2 months ago

Thanks.
I've can to similar conclusions that you point in your article.
In particular that format (xml or json) is not the argument deciding about performance.
Difference won't be dramatic.

In contrast IO should be scrutinized a lot.

E.g. we got big speed up once we implemented Saxon resolvers that know to read and write directly into ZIP.
See zipfilesystemprovider.html

RE: xml vs json - Added by Michael Kay about 2 months ago

There are certainly plenty of workloads where parsing and serialization costs are an order of magnitude higher than the transformation cost. We tend to neglect this because there's not much we can do to improve them.

    (1-12/12)

    Please register to reply