Project

Profile

Help

Bug #3443

closed

Performance degradation between 9.2 and 9.8

Added by Michael Kay over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2017-09-13
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.8
Fix Committed on Branch:
9.8
Fixed in Maintenance Release:
Platforms:

Description

A user has submitted an XSLT transform that takes 102s under 9.2, increasing to 241s under 9.8.

Summary: the main problem was that with hotspot bytecode generation enabled (the default setting for Saxon-EE), a recursive function that was building a tree by making recursive calls to copy a source tree, was being executed in pull mode rather than push mode, resulting in an excessive amount of tree-to-tree copying.

Actions #2

Updated by Michael Kay over 6 years ago

I've taken Java profiles under both 9.2 and 9.8 and they are very different, but no dramatic hotspots stand out. What's fairly noticeable in both cases is that tree construction features quite a lot.

9.8 seems to be making a lot of calls on Configuration.getBooleanProperty, which aren't cheap because they involve a hash table lookup. I've often wondered whether this could be a performance hotspot but I've never seen it appear in a profile before. The properties being repeatedly read are:

getBooleanProperty http://saxon.sf.net/feature/markDefaultedAttributes

getBooleanProperty http://saxon.sf.net/feature/use-typed-value-cache

getBooleanProperty http://saxon.sf.net/feature/eagerEvaluation

Many of these calls are also associated with tree construction.

Adding some counters, 9.8 creates 50.2M instances of TinyTree, while 9.2 creates only 15K instances. So this is probably going to be the critical factor.

Actions #3

Updated by Michael Kay over 6 years ago

A typical call to create a TinyTree is coming from int:clone at sharedFunctions.xsl#61, itself called from the xsl:for-each-group at createPtcarBlocks.xsl#287.

int:clone() creates many tiny parentless elements.

At one stage (certainly in 9.2) we attempted to reuse the TinyTree data structure when creating sequences of element nodes, to avoid the overhead of creating a new TinyTree structure for each new element. This proved complex, it was error prone and a performance bottleneck when we started introducing multi-threading, and measurements indicated that the benefit was very modest, so we dropped it. But I suspect it's what's causing the difference here.

I patched 9.2 to stop it reusing the TinyTree structure; it now creates 49M trees and takes 220ms, so I think this is probably the key difference.

Actions #4

Updated by Michael Kay over 6 years ago

I tried running 9.8 with -tree:linked and it froze under memory pressure.

Actions #5

Updated by Michael Kay over 6 years ago

I tried rewriting the int:clone function to use push processing:

<xsl:function name="int:clone" as="element()*">
        <xsl:param name="pItem" as="element()*"/>
        <xsl:param name="pValidFromDate" as="xs:date"/>
        <xsl:param name="pValidToDate" as="xs:date"/>
        <xsl:apply-templates select="$pItem" mode="clone">
            <xsl:with-param name="pValidFromDate" select="$pValidFromDate" as="xs:date" tunnel="yes"/>
            <xsl:with-param name="pValidToDate" select="$pValidToDate" as="xs:date" tunnel="yes"/>
        </xsl:apply-templates>
    </xsl:function>
    <xsl:template match="*" mode="clone">
        <xsl:param name="pValidFromDate" as="xs:date" tunnel="yes"/>
        <xsl:param name="pValidToDate" as="xs:date" tunnel="yes"/>
        <xsl:element name="{name()}">
            <xsl:sequence select="@* except (@validFromDate, @validToDate)"/>
            <xsl:if test="exists(@validFromDate) and exists($pValidFromDate)">
                <xsl:attribute name="validFromDate" select="int:max(($pValidFromDate, @validFromDate))"/>
            </xsl:if>
            <xsl:if test="exists(@validToDate) and exists($pValidToDate)">
                <xsl:attribute name="validToDate" select="int:min(($pValidToDate, @validToDate))"/>
            </xsl:if>
            <xsl:apply-templates select="child::*" mode="clone"/>
        </xsl:element>
    </xsl:template>

This brings the elapsed time down to 144s. Interestingly, it still creates 1.5M TinyTree instances. We can probably bring it down further by replacing critical calls on int:clone by apply-templates processing.

The reason push processing is faster is that it avoids copy operations. xsl:element in push mode writes a start tag to the result, then invokes its body in push mode, then writes an end tag to the result; unlike the pull-mode function call, the child nodes are written incrementally to the final output rather than being created as intermediate results and then copied repeatedly.

In principle function calls can also be evaluated in push mode. We should look at whether this is achievable for this case.

Actions #6

Updated by Michael Kay over 6 years ago

Changing the call on int:clone() within function int:getRouteEdgeForPtcar to use apply-templates gives no further improvement.

Actions #7

Updated by Michael Kay over 6 years ago

I now tried running with -TP:profile.html and got an exception:

java.lang.IllegalStateException: *** Internal Saxon error: local variable encountered whose binding has been deleted

Dealing with this separately as bug #3444.

Actions #8

Updated by Michael Kay over 6 years ago

I now have the profile.html output. It shows the newly-introduced template (with mode="clone") as the dominant component, with 49,658,326 invocations, accounting for about 70% of total execution time (147s). Next is int:transformToExportPtcar with only 1851 invocations but a net execution time of 36s.

This function has a similar "pull" style to it, calling further functions to construct the children and then copying them. It looks to me as if some of these function calls would benefit from inlining,

Actions #9

Updated by Michael Kay over 6 years ago

I shall turn now to the question of why the critical function is being evaluated in "pull" mode (using its iterate()) method, rather than in push mode (using its process() method). A recursive function constructing a tree in this way would certainly benefit from using push mode.

Generally execution tries to avoid changes in polarity, because a change in polarity (e.g. from pull to push) typically involves allocating memory to store intermediate results. In this case we function is being evaluated in the course of evaluating the variable at line 97:

<xsl:variable name="nodesByPtcar"
      select="int:getNodeByPtcarForPtcarVersion(@id, @validFromDate, @validToDate)[@validToDate >= $purgeDate]"/>

The variable is evaluated lazily, which means that in the first instance a Closure is placed on the stack; and when a Closure is eventually evaluated, it always evaluates the underlying expression in pull mode (that is, by calling iterate()). It then continues in pull mode until reaching the UserFunctionCall, which invokes the call() method on the UserFunction. The UserFunction then makes its own decision about whether to use pull or push evaluation. The decision is made by calling ExpressionTool.lazyEvaluationMode(), which never chooses push mode. What this means in effect is that a function is only evaluated in push mode if the function call itself is evaluated in push mode.

This logic has not changed much since 9.2, and 9.2 also evaluates the function in pull mode.

I have tweaked UserFunction.computeEvaluationMode() so that it always chooses push evaluation if the body of the function consists of node-construction instructions (such as xsl:element). (Needs further work to get the rules right, but it's working for this case.) This brings the 9.8 execution time down to 166s, which is a substantial improvement but still short of the 9.2 figure.

We're still creating 1.5M instances of TinyTree which is an improvement on the original 50M but well short of the 15K instances created under 9.2.

To reduce this further we would need to evaluate a lot more of the functions in push mode. For example, int:clone is called from the function at line 270:

<xsl:function name="int:getPtcarOrganizationForPtcarVersion" as="element()*">

    <xsl:param name="pPtcarId" as="xs:integer"/>

    <xsl:param name="pValidFromDate" as="xs:date"/>

    <xsl:param name="pValidToDate" as="xs:date"/>

    <xsl:sequence

            select="int:clone($ptcarOrganizationXML/key('by-ptcarId', string($pPtcarId))[$pValidToDate ge xs:date(@validFromDate) and xs:date(@validToDate) ge $pValidFromDate], $pValidFromDate, $pValidToDate)"/>

</xsl:function>

which is evaluated in pull mode, meaning it creates a new TinyTree to hold the results of int:clone on each call. If we follow the principle that this should be in push mode because it constructs new nodes, we have to do cross-function analysis which of course runs into problems when the function calls are recursive.

Actions #10

Updated by Michael Kay over 6 years ago

I'm thinking now about a completely different way of doing this. Probably too radical to contemplate for a maintenance release, but we'll see.

The reason pull-mode tree construction is expensive is that copying nodes is expensive. When a node is created as a TinyTree, attaching it as a child to another node means physically rebuilding the whole data structure. What if we could use a model where copying is cheap, or perhaps even free?

Suppose we call a function, and it returns a LinkedTree instance, and we want to use that in the content of a larger tree: then in principle we can just graft it in by changing a couple of pointers. Actually it's not quite as easy as that, because the LinkedTree gives every node a sequence number, used for fast sorting of nodes into document order. So it might need to be something slightly different from the LinkedTree.

The object we use to represent this tree fragment does not in fact need to support the full NodeInfo interface. It only needs to support two operations: (a) the ability to add it as a child to a larger tree, and (b) the ability to turn itself into a fully-fledged NodeInfo eventually, for example when it is saved as a variable. The nodes don't need to have unique identifiers, and they don't need to have parent pointers, and they don't need to support XPath navigation. The operation of turning it into a full NodeInfo can involve a physical rebuild into a TinyTree (or any other representation).

For this use case, there are still a few glitches where full copying will happen. For example at line 64 we have:

         <xsl:for-each select="$pPtcar">
            <xsl:variable name="locations"
                          select="int:getLocationForPtcarVersion(@id, @validFromDate, @validToDate)[@validToDate >= $purgeDate]"/>
            <xsl:copy>
                <xsl:sequence select="@*, node()"/>
                <xsl:if test="exists($locations)">
                    <xsl:element name="int:locations">
                        <xsl:sequence select="$locations"/>
                    </xsl:element>
                </xsl:if>
            </xsl:copy>
        </xsl:for-each>

Here the variable $locations can't be inlined, because there are two references. So this would involve physical construction of a temporary tree. If we're smart about it, though, the optimizer could turn this into:

         <xsl:for-each select="$pPtcar">
            <xsl:copy>
                <xsl:sequence select="@*, node()"/>
                <xsl:where-populated">
                    <xsl:element name="int:locations">
                        <xsl:sequence select="int:getLocationForPtcarVersion(@id, @validFromDate, @validToDate)[@validToDate >= $purgeDate]"/>
                    </xsl:element>
                </xsl:where-populated>
            </xsl:copy>
        </xsl:for-each>

which eliminates the need for the copy operation. (Caveat: this only works because we have enough static type information to know that getLocationForPtcarVersion cannot return attribute nodes.

Until 9.7 we had an option called "lazy tree construction" that was a little similar in its approach. It fell into disuse and was dropped in 9.8 (it was only used if explicitly requested, and was not well documented). This was slightly different in that xsl:element created an "Unconstructed Element" object which, like a Closure, was promise to create the element on demand, with a reference to the xsl:element instruction and a copy of the dynamic context to be used. One of the problems with this is the cost of maintaining the dynamic context information.

Actions #11

Updated by Michael Kay over 6 years ago

_

_*Note: I have just seen that the problem that led to us stopping the use of a single TinyTree to hold sequences of parentless elements is bug #2220: a thread safety problem.

I'm looking now at places where ElementCreator.evaluateItem() is called: this essentially forces an element node to be physically constructed, meaning there is the possibility that it will immediately need to be copied.

In most cases we are immediately putting the constructed element in a variable, so there's not much we can do to avoid constructing it.

Here's an example where we could do better:

<xsl:function name="int:addValidities" as="element()*">
        <xsl:param name="pPtcar" as="element()*"/>
        <xsl:for-each select="$pPtcar">
            <xsl:variable name="validities"
                          select="int:extractValidities((nm:ptcarOrganizations/nm:ptcarOrganization, int:locations/int:location, int:subLocations/int:subLocation, int:nodes/*, int:trackEdges/int:trackEdge, int:routeEdges/int:routeEdge), @validFromDate, @validToDate)[@validToDate >= $purgeDate]"/>
            <xsl:copy>
                <xsl:sequence select="@*, node()"/>
                <xsl:element name="int:validities">
                    <xsl:sequence select="$validities"/>
                </xsl:element>
            </xsl:copy>
        </xsl:for-each>
    </xsl:function>

With the changes made earlier, we now mark this function to be evaluated in push mode. Moreover, the variable $validities has been eliminated, so we could potentially call int:extractValidities() in push mode. However, we do need to materialize the result, because the filter expression examines attributes on the constructed node before deciding whether to retain it.

It's interesting to note that int:extractValidities is returning a set of date ranges, in the form of a sequence of parentless elements with the start/end dates represented as attributes. Using a sequence of maps here would be much better! But really, the old (9.2) mechanism of using a single TinyTree structure to hold a sequence of parentless elements worked very well for this kind of scenario (except for the thread safety issues!). Perhaps we could reinstate something similar, especially for the case where we can statically tell that we are creating a sequence of parentless element nodes.

Actions #12

Updated by Michael Kay over 6 years ago

I'm now wondering whether we could make the copy operation faster, where one TinyTree has to be copied to create child nodes in another. Could we recognize the situation and do some kind of bulk append, modifying offsets by an appropriate amount?

At present, given code like that at line 65:

             <xsl:variable name="locations"
                          select="int:getLocationForPtcarVersion(@id, @validFromDate, @validToDate)[@validToDate >= $purgeDate]"/>
            <xsl:copy>
                <xsl:sequence select="@*, node()"/>
                <xsl:if test="exists($locations)">
                    <xsl:element name="int:locations">
                        <xsl:sequence select="$locations"/>
                    </xsl:element>
                </xsl:if>
            </xsl:copy>

what is happening is this:

The xsl:sequence results in a call of VariableReference.process(), which iterates over the contents of the variable (evaluating it incrementally in a MemoClosure, as it happens). For each item, we call item.append(out,...) where out is a SequenceOutputter. TinyElementImpl.copy() calls ComplexContentOutputter.startElement(). which buffers all the information. It then sends namespaces (in bulk) and attributes (10 of them, individually), which the CCO also buffers. Then it calls CCO.startContent(), which unbuffers all the data, does some checking, and sends it on as a sequence of calls to the NamespaceReducer. The NamespaceReducer does more checking and then passes the calls on the TinyBuilder for the target tree.

Potentially, the TinyElement.copy() method could ask the PipelineConfiguration whether there is a TinyBuilder at the end of the pipeline, and if there is, it could call a bulkCopyElement() method on the TinyBuilder which would simply copy the relevant parts of the TinyTree vectors to the new destination:

nodeKind - unchanged

depth - fixed delta added to each value

next - fixed delta added to each value

alpha - varies by node kind; adds a delta that varies with the node kind

beta - ditto

nameCode - fingerprint remains unchanged, prefix code needs to be recalculated using the prefixPool of the target tree

prior - not needed since the target tree is still under construction

typeArray - unchanged assuming validation=preserve

typedValueArray - ditto

idRefElements, idRefAttributes, nilledElements, defaultedAttributes, idTable, entityTable = not sure, check the rules

attParent - apply a fixed delta

attCode - as for nameCode

attValue - copy unchanged

attTypedValue - copy unchanged assuming validation=preserve

attType - copy unchanged if validation = preserve

namespaceParent - apply fixed delta

namespaceBinding - copy unchanged assuming copyNamespaces="all". The root element being copied may need some namespace undeclarations added.

lineNumbers, columnNumbers, systemIdMap - dropped during copy

This sounds feasible and there is no doubt it would be vastly faster than the current process which is highly incremental and does much more validation than is needed: and which we can see is accounting for a significant part of the cost.

Actions #13

Updated by Michael Kay over 6 years ago

I have done a trial implementation of the "bulk copy" fast path for copying a subtree from one TinyTree to another; the saving is worthwhile (from 154 to 141s) but not as significant as I would have hoped.

Actions #14

Updated by Michael Kay over 6 years ago

I am now looking at the effect of bytecode generation. With the limit on the number of items processed set to 500, I am seeing execution time with bytecode off of 49s, but with bytecode either on (everywhere) or defaulted (to hotspot bytecode generation), the time is around 90s.

The new -TB option produces a report on the effectiveness of bytecode generation. The results are difficult to interpret: perhaps some of the bytecode fragments are so small that the measurement distorts the numbers. But there are many instances throughout this stylesheet of predicates such as

$pValidToDate ge xs:date(@validFromDate) and xs:date(@validToDate) ge $pValidFromDate

or

$vv:v7 >= @validFromDate and @validToDate >= $vv:v8

Some of these show a good speed-up (e.g. a ratio of 0.342 or in one case 0.052), and some of them show bytecode being worse (ratios such as 1.7 or 2.1). It would be interesting to know whether these numbers, and the variations between them, have any significance.

Most of the cases where the ratio is bad (the worst is 3.3) show a low execution count, so perhaps things have not had a chance to settle down (the JVM's hotspot compilation to optimize the bytecode that Saxon generates will only kick in some time after Saxon generates it). A curiosity is that for a surprising number of the poorly-performing bytecode fragments, the number of executions is precisely 347.

Note that the stylesheet writes this predicate in essentially two different ways: in some cases the attributes are converted to xs:date() values and compared as dates; in other cases they are compared as strings. Also, the operator is sometimes written "<=" and sometimes "le". Since the date values all have 4-digit years with no timezone, either technique will give the same result. One would expect that converting the attributes to xs:date values is more expensive (a) because of the extra validation performed, and (b) because string comparison can exit as soon as two characters are found to be different. The actual string-to-date conversion will not benefit from byte-code generation, since the byte-code simply calls a library routine in the Saxon run-time.

This stylesheet is generally very good at declaring the types of variables and parameters but there are a few exceptions. For example at line 99 we have the predicate [@validToDate >= $purgeDate] where $purgeDate is a global variable with undeclared type. It is actually a singleton (untyped) attribute node but the code has to assume that it is a sequence of attributes.

So far this isn't shedding any light on why the bottom-line number should be worse with bytecode. I decided to repeat the experiment with the full dataset (limit=9999). Bytecode off: 121s. Bytecode defaulted: 208s. Not good!

I noticed something interesting in the bytecode report. Around line 470 there are a number of functions similar to this one:

<xsl:function name="int:transformToExportRouteEdge" as="element()*">
        <xsl:param name="pRouteEdge" as="element()*"/>
        <xsl:for-each select="$pRouteEdge">
            <xsl:sort select="@id"/>
            <xsl:copy>
                <xsl:sequence select="@* except (@validFromDate, @validToDate), node()"/>
            </xsl:copy>
        </xsl:for-each>
    </xsl:function>

The body of this function, which appears in the report as sort($pRouteEdge)!xsl:copy is being executed 99 times in push mode, and 1448 times in pull mode. Because the first 99 executions are interpreted, this tells us that we are using push mode during interpreted evaluation but not during compiled evaluation.

Actions #15

Updated by Michael Kay over 6 years ago

I've been focusing on calls of the expression sort($pSubLocation)!xsl:copy at line 483 (the body of function int:transformToExportSubLocation), using --generateByteCode:10 to force hotspot bytecode generation to happen earlier (but not immediately).

This is called from line 447:

                <xsl:for-each select="int:subLocations">
                    <xsl:copy>
                        <xsl:sequence select="int:transformToExportSubLocation(*)"/>
                    </xsl:copy>
                </xsl:for-each>

In the first few calls, we have ForEach.processLeavingTail calling ElementCreator.processLeavingTail calling UserFunctionCall.process, calling UserFunction.process, calling ByteCodeCandidate.process - push all the way.

Then (after various fragments have been compiled to bytecode, the next call is from a Closure:

at com.saxonica.ee.bytecode.ByteCodeCandidate.targetExpression(ByteCodeCandidate.java:162)

at com.saxonica.ee.bytecode.ByteCodeCandidate.iterate(ByteCodeCandidate.java:275)

at net.sf.saxon.value.Closure.iterate(Closure.java:156)

at gen_SlashContextMappingFunction_820.map(file:/Users/mike/bugs/2017/pardon/perf/test-2/xsl/createPtcarBlocks.xsl:464)

I suspect that the issue is that UserFunctionCallCompiler never generates a push-mode function call. It does not have a compileToPush() method, so it inherits the generic method, which compiles the function call in pull mode, and for each item returned, pushes this to the destination.

I have added code to compile function calls in push mode. Sadly it makes very little difference: with limit=500, we are seeing execution time of 50s with bytecode off, 86s with bytecode=100 (hotspot).

Reinstating the TinyTree counters, we are getting 565K trees created with bytecode off, 19M trees created with bytecode=100.

With bytecode off: time 50s, trees created 565K, push function calls 18.6M, pull function calls 1.2M.

With bytecode=100 (hotspot): time 86s, trees created 19.2M, push function calls 18.6M, pull function calls 1.3M.

So we've succeeded in getting it to use push function calls more extensively, but it hasn't helped in reducing the number of trees created.

Actions #16

Updated by Michael Kay over 6 years ago

I measured the number of Closures created: 863K with bytecode off, 384K with bytecode on. Not what I expected!

The figures for bytecode enabled suggest that in most cases, we are creating a new tree on every function call, even though the function call is in push mode.

It seems from some sampling in the debugger that although we have compiled the function body in "push" mode, we have generated code with no "process" method which is therefore inheriting a process method that actually pulls.

But I've monitored the compilation process the only cases where we are compiling an expression for push execution where the expression has a fallback process method are cases where the interpreted code also falls back to pull execution (AxisExpression, VennExpression, FilterExpression).

Further tracing shows that the recursive int:clone method is definitely being called in push mode, but it is definitely creating a new tree on each invocation.

Actions #17

Updated by Vladimir Nesterovsky over 6 years ago

Please allow me to comment on the issue, and ignore if it's not relevant.

From what I understand the problem is related to the cost of copying of trees.

What if you will refactor node storage to avoid storing parents but only children and attributes, and to store parent only in node pointer.

Something like this:

Node { ... }

NodeRef { NodeRef parent; Node ref; }

This would allow to share subtrees at zero cost.

Though it's a radical step it might increase performance and even save the memory.

Actions #18

Updated by Michael Kay over 6 years ago

Vladimir, you have captured my current thinking entirely! In fact I already have a prototype implementation of this concept produced a few months ago. It was inspired by the talk I gave last year in Prague about transformation of JSON structures and the realisation that JSON tree models typically have no parent pointers, but by remembering how a node was reached, we can still navigate back upwards in the tree. We've also been working with similar ideas in the Saxon-JS environment, where we have no push-processing model internally: here we are using DOM, so the tree model does have parent pointers, but we try to avoid copying by marking the root of a tree as "under construction" until such time as it is committed to a particular parent.

Actions #19

Updated by Michael Kay over 6 years ago

Coming back to this hopefully with a clearer head.

If I run with limit=5 (limiting the number of ptcar elements processed), and monitoring the number of TinyTree instances created, then the number of trees depends on the generateByteCode setting as follows:

bytecode:on 5423 trees (time 13.96s)

bytecode:off 5423 trees (time 14.46s)

bytecode defaulted (i.e. hotspot bytecode) - 329754 trees (time 16.14)

Execution times reflect this, though the variation is not as dramatic. Interestingly the performance ratio gets worse as the limit value is increased: with limit=5 we see an increase from 14s to 16s, for limit=50 the increase is from 18s to 31s, for limit=500 it is from 63s to 190s.

Looking at the situation with bytecode defaulted (that is, hotspot bytecode generation), I've done a sample of printStackTrace() calls to see where the trees are being created. A typical trace looks like this:

	at net.sf.saxon.tree.tiny.TinyTree.<init>(TinyTree.java:185)
	at net.sf.saxon.tree.tiny.TinyBuilder.open(TinyBuilder.java:123)
	at net.sf.saxon.event.ProxyReceiver.open(ProxyReceiver.java:86)
	at net.sf.saxon.event.ComplexContentOutputter.open(ComplexContentOutputter.java:209)
	at net.sf.saxon.event.SequenceWriter.createTree(SequenceWriter.java:125)
	at net.sf.saxon.event.SequenceWriter.startElement(SequenceWriter.java:179)
	at gen_SlashContextMappingFunction_21.map(file:/Users/mike/bugs/2017/pardon/perf/test-2/xsl/sharedFunctions.xsl:75)
	at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:61)
	at com.saxonica.ee.bytecode.GeneratedCode.process(GeneratedCode.java:96)
	at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:84)
	at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:134)
	at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:650)
	at gen_SlashContextMappingFunction_21.map(file:/Users/mike/bugs/2017/pardon/perf/test-2/xsl/sharedFunctions.xsl:75)
	at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:61)
	at com.saxonica.ee.bytecode.GeneratedCode.process(GeneratedCode.java:96)
	at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:84)
	at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:134)
	at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:650)
	at gen_SlashContextMappingFunction_21.map(file:/Users/mike/bugs/2017/pardon/perf/test-2/xsl/sharedFunctions.xsl:75)
	at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:61)
	at com.saxonica.ee.bytecode.GeneratedCode.process(GeneratedCode.java:96)
	at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:84)
	at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:134)
	at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:650)
	at gen_SlashContextMappingFunction_21.map(file:/Users/mike/bugs/2017/pardon/perf/test-2/xsl/sharedFunctions.xsl:75)
	at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:61)
	at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:62)

The stacktrace entries for generated bytecode helpfully give us the source location in the original XSLT code, pointing the blame clearly at the recursive calls on int:clone(). The tree creation is happening despite the fact that the function, and its compiled body, are being called in push mode.

A SequenceWriter creates a new tree if startElement() is called and there is no existing tree to write to. So the implication is that we are creating a new SequenceWriter when we should be using an existing one. The SequenceWriter is created as a consequence of the generated bytecode calling SeqiemceOutputter.allocateSequenceOutputter(). This call appears to be generated by ToPushCompiler.compileToIterator(), implying that we are compiling something that constructs nodes in pull mode. A debugger breakpoint in the bytecode generation suggests this is happening when we compile the xsl:for-each instruction that is the body of the int:clone() function. We are compiling this xsl:for-each instruction in pull mode because the evaluation that triggers the bytecode generation is an iterate() evaluation, and the reason it is an iterate() evaluation is that we are evaluating a Closure (a lazily-evaluated variable)

With eager bytecode generation, we are generating both a process() and an iterate() method for the ForEach instruction, so that at run-time, we can use whichever is more appropriate (XSLFunction line 487). With just-in-time generation, we are choosing one of these methods to generate, and in this case we are choosing the wrong one.

The remedy would appear to be that when generating a ByteCodeCandidate, we should be able to specify what evaluation modes should be supported.

Actions #20

Updated by Michael Kay over 6 years ago

I have made the change so the body of a function is always compiled with iterate() and process() methods. (An evaluateItem() or effectiveBooleanValue() method will be generated if the function is called in that mode).

The number of trees created, and the execution time, are now aligned with the bytecode:on and bytecode:off figures.

The elapsed time for the full run is now 132s. This is still a little slower than the 9.2 figure, no doubt because we can't easily reproduce the 9.2 design where multiple parentless elements were written to the same TinyTree structure (recall that we scrapped that design because of thread safety issues). But it's now hopefully within acceptable bounds.

This was run with the optimized "bulk copy" of nodes from one tinyTree to another. Without the bulk copy, the run-time is 145s. Introducing the bulk copy in a maintenance release feels a little risky - it's definitely an intricate bit of code that could have bugs in it, and indeed it isn't yet completely finished (for example it isn't copying over properties such as id, idref, and nilled). I will commit the code with the invocation of bulk copy commented out, and leave the bug entry open while I review what to do about this.

Actions #21

Updated by Michael Kay over 6 years ago

  • Description updated (diff)
Actions #22

Updated by Michael Kay over 6 years ago

  • Applies to branch 9.8 added
  • Fix Committed on Branch 9.8 added
Actions #23

Updated by Guy Pardon over 6 years ago

Maybe you could introduce the bulk copy as an optional feature that can be activate via a flag? That way, you would not break the stability of the maintenance branch.

Also, do you have an approximate estimate on a possible fix release? Just to know, for now we use the fallback to 9.2.

Thanks!

Actions #24

Updated by Michael Kay over 6 years ago

We're going to be putting together a maintenance release next week. It usually takes a couple of days, but it depends on what problems we find in testing. Yes, I'll consider putting out the bulk copy under a configuration flag.

Actions #25

Updated by Michael Kay over 6 years ago

  • Status changed from In Progress to Resolved

Closing this now.

The new "fast copy" code for TinyTree is committed, but disabled by default: it can be enabled by setting the static boolean variable net.sf.saxon.tree.tiny.TinyTree.useFastCopy to true.

For the next major release, I am working on a tree model that offers zero-cost copying, by avoiding holding node identity and parent pointers in the actual tree structure, but only allocating them in a virtual node materialized when the underlying node is reached by XPath navigation. For the kind of design pattern used extensively in this stylesheet, where performance is dominated by copying of subtrees, I think this offers the potential to get performance down from 2 minutes to 3 or 4 seconds.

Actions #26

Updated by O'Neil Delpratt over 6 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.8.0.5 added

Bug fix applied in the Saxon 9.8.0.5 maintenance release.

Please register to edit this issue

Also available in: Atom PDF