Bug #4663
closedPerformance regression (neural network application)
Added by Michael Kay over 4 years ago. Updated about 4 years ago.
100%
Description
Roger Costello raised an enquiry on xsl-list about performance of a stylesheet that runs a neural network, This issue is opened to log progress of the investigation.
Files
function-timings.ods (23.3 KB) function-timings.ods | Open Office spreadsheet giving function timings (nanosecs) for 9.9 and 10.1 | Michael Kay, 2020-08-04 00:09 |
Updated by Michael Kay over 4 years ago
First attempt to run this took
Execution time: 37m 18.325426704s (2238325.426704ms) Memory used: 2807Mb
which is an order of magnitude longer than the "whopping" 376,589.147ms reported by Roger.
From the Mac activity monitor, I suspect there is memory pressure here.
With a test case that takes this long, my first step in investigation is usually to attempt to cut down the data size, to achieve 2 things: (a) to make measurement more manageable, and (b) to get a feel for the scaleability - how does execution time depend on data size.
But before that, I'll try the same run with increased memory allocated.
Updated by Michael Kay over 4 years ago
The extra memory made no difference:
Execution time: 38m 4.343215796s (2284343.215796ms) Memory used: 1764Mb
Updated by Michael Kay over 4 years ago
Tried running it on 9.9; it's vastly faster
Execution time: 29.804892527s (29804.892527ms) Memory used: 1,180,616,528
Rather weirdly, this is 10 times faster than Roger's result, and 100 times faster than my 10.0 figure.
When there's a big difference between Saxon releases, the two things I generally check next are (a) where in Saxon is it spending its time, and (b) how do the optimizer traces (and optimized code trees) compare. So that will be the next step.
Updated by Michael Kay over 4 years ago
Running on 9.9 with -TP increased the run time to 2m 2.428310414s -- presumabiy the difference is the measurement overhead (but it's also possible that injecting the trace code inhibits some optimizations).
The profile for 9.9 is dramatically different from that supplied by Roger. Although the total execution time is lower, many of the counts for function execution are much higher: for example Roger reports 99 executions of function f:cell, but this profile shows 40,851,500 executions. The difference in counts could be because functions are being inlined in 10.1 more than they were in 9.9. But why should this have an adverse effect on performance?
It's also possible that the difference between the 9.9 and 10.1 execution counts is merely due to the way things are measured: perhaps 9.9 is measuring executions of inlined functions as if they had not been inlined? (For example, because the inliining happens AFTER insertion of the measurement instructions).
NOTE: Roger's 10.1 data shows 99 executions of f:train-the-neural-network whereas the 9.9 profile shows 100 executions. This is consistent with the first invocation (from template main) being inlined and therefore not counted. I'm mildly surprised that a recursive function should be inlined: but it's tail-recursive, so perhaps that's OK.
I'm running 10.1 with -opt:-f, that is, with function inlining suppressed. It's been running for 10 minutes, so it looks as if this isn't making much difference.
A general observation: when the discrepancy between two releases is very large, then it's usually much easier to identifiy the cause than when the difference is very small. However, investigating anything that runs for over 30m is always difficult because of the sheer time needed to perform experiments.
Perhaps we should measure both 9.9 and 10.1 with a reduced workload - presumably by reducing the size of the training data set. This should also yield data on the relationship of executiion time to data set size.
The run with -opt:-f is still running after about 40 minutes, so it's not going to give any new information, so I shall kill it.
Updated by Michael Kay over 4 years ago
- Tracker changed from Support to Bug
- Priority changed from Low to Normal
Reclassifying this as a bug since a performance regression of this magnitude always needs investigation, even if it only occurs under very rare circumstances.
Updated by Michael Kay over 4 years ago
I added an xsl:message with saxon:time="yes" to the main training loop. For 10.1 we get:
2020-08-03T07:19:46.955 - Done processing record: 99 to go
2020-08-03T07:20:11.321 - Done processing record: 98 to go
2020-08-03T07:20:33.321 - Done processing record: 97 to go
2020-08-03T07:20:55.321 - Done processing record: 96 to go
2020-08-03T07:21:18.512 - Done processing record: 95 to go
2020-08-03T07:21:40.702 - Done processing record: 94 to go
that is, a consistent time of about 22 seconds per training record.
For 9.9 we get
2020-08-03T07:24:00.212 - Done processing record: 99 to go
2020-08-03T07:24:01.224 - Done processing record: 98 to go
2020-08-03T07:24:01.699 - Done processing record: 97 to go
2020-08-03T07:24:02.012 - Done processing record: 96 to go
2020-08-03T07:24:02.402 - Done processing record: 95 to go
2020-08-03T07:24:02.667 - Done processing record: 94 to go
2020-08-03T07:24:02.950 - Done processing record: 93 to go
2020-08-03T07:24:03.270 - Done processing record: 92 to go
that is, around 300ms per training run.
So (a) the performance appears to be linear wrt the size of he training data set, and (b) we can conduct further investigations with a much smaller data set.
Updated by Michael Kay over 4 years ago
With a training data set of 2 records, 9.9 completes in 2.499s, while 10.1 takes 49.35s
Updated by Michael Kay over 4 years ago
I ran both cases with the -T option, with rather surprising results.
With 9.9, execution time rises to 2m 10.92s, and the trace file occupies 797Mb
With 10.1, execution time increases only to 1m 18.92s, and the trace file occupies 231Mb
There could be differences in the way tracing works that account for some of this, but the difference is difficult to explain. It does suggest that the execution paths in the two releases are very different, and it also suggests that wherever the inefficiency is in 10.1, it is at the "sub-instruction" level: ie. we are not executing more instructions, they are just taking longer.
Updated by Michael Kay over 4 years ago
The Java profile for 10.1 starts like this:
1 49.99% 49.99% 4329 300052 java.net.SocketInputStream.socketRead0
2 16.78% 66.77% 1453 300553 EE_CheckCardinality_18841812768.evaluateItem
3 8.91% 75.68% 772 300563 net.sf.saxon.value.SequenceExtent.makeSequenceExtent
4 7.58% 83.26% 656 300562 java.util.ArrayList.<init>
5 7.38% 90.64% 639 300564 java.util.ArrayList.<init>
6 3.71% 94.34% 321 300537 com.saxonica.ee.bytecode.ByteCodeCandidate.evaluateItem
7 0.90% 95.24% 78 300552 com.saxonica.ee.bytecode.iter.CompiledFilterIterator.next
8 0.40% 95.65% 35 300581 net.sf.saxon.expr.ItemMappingIterator.next
9 0.28% 95.92% 24 300533 java.util.Arrays.copyOf
10 0.28% 96.20% 24 300032 java.lang.ClassLoader.defineClass1
11 0.28% 96.48% 24 300566 java.util.Arrays$ArrayList.toArray
12 0.25% 96.73% 22 300565 java.util.ArrayList.<init>
13 0.18% 96.92% 16 300561 java.util.ArrayList.<init>
14 0.15% 97.07% 13 300597 net.sf.saxon.tree.iter.ListIterator.next
15 0.14% 97.21% 12 300570 net.sf.saxon.expr.ItemMappingIterator.next
16 0.13% 97.33% 11 300578 gen_ForItemMappingAction_line155_103.mapItem
17 0.12% 97.45% 10 300572 EE_arithmetic_101255465236.evaluateItem
18 0.12% 97.56% 10 300594 net.sf.saxon.tree.iter.ListIterator.next
19 0.09% 97.66% 8 300530 com.saxonica.ee.bytecode.ByteCodeCandidate.effectiveBooleanValue
So it's spending a fair bit of time in generated bytecode. Nothing wrong with that except that it makes analysis more difficult.
I've realised that I've been running 10.1 with -opt:-f which may cast doubt on some of the measurements recorded above; however, it doesn't appear to have a big effect on the bottom line.
Switching bytecode off with -opt:-c has very little effect on the bottom line (exec time 53.08s) and gives us this profile:
1 49.98% 49.98% 4422 300057 java.net.SocketInputStream.socketRead0
2 22.82% 72.80% 2019 300550 java.util.ArrayList.<init>
3 9.05% 81.86% 801 300560 net.sf.saxon.expr.Expression.iterate
4 7.93% 89.79% 702 300559 net.sf.saxon.expr.Expression.iterate
5 1.94% 91.74% 172 300580 net.sf.saxon.expr.FunctionCall.iterate
6 0.92% 92.65% 81 300578 net.sf.saxon.expr.FunctionCall.iterate
7 0.80% 93.46% 71 300540 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
8 0.71% 94.17% 63 300579 net.sf.saxon.om.FocusTrackingIterator.next
9 0.64% 94.81% 57 300537 net.sf.saxon.expr.VariableReference.evaluateItem
10 0.51% 95.32% 45 300558 java.util.ArrayList.<init>
11 0.50% 95.82% 44 300534 net.sf.saxon.expr.VariableReference.evaluateItem
12 0.38% 96.20% 34 300536 net.sf.saxon.value.NumericValue.<init>
13 0.36% 96.56% 32 300535 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
14 0.31% 96.87% 27 300032 java.lang.ClassLoader.defineClass1
15 0.20% 97.07% 18 300557 java.util.Arrays$ArrayList.toArray
16 0.19% 97.26% 17 300569 java.util.Arrays.copyOf
No obvious anomalies here.
Updated by Michael Kay over 4 years ago
I have added xsl:message/@saxon:time
instructions between the successive phases of computation of f:train()
, also adding code to each of the messages designed to force eager evaluation of the previous variable. The results show the time dominated by the cost of computing $updated-wih
. Note also that the input and output to this phase both have size 78403, which is considerably larger than any f the other matrices involved.
So we need to look more carefully at how the function matrix:addition() is being evaluated. The body of this function is
(for $i in 1 to $M[1], $j in 1 to $M[2] return f:cell($M, $i, $j) + f:cell($N, $i, $j))
Some further tracing tells us $M[1]
= 100, $M[2]
= 784.
Updated by Michael Kay over 4 years ago
Now studying the generated code for the call to matrix:addition (obtained using -explain). The optimizer trace contains too much information to digest easily, but the print of the final expression tree shows us that both matrix:addition() and its two calls to f:cell() have been inlined, resulting in the code:
<?xml version="1.0" encoding="UTF-8"?>
<let line="184" var="Q{}updated-wih" slot="107" eval="3">
<let baseUri="file:/Users/mike/bugs/2020/4663-Costello/Neural/matrix-operations.xsl"
ns="activation=activation-functions f=local-function math=~ matrix=matrix-functions xs=~ xsl=~ xml=~"
var="Q{}M"
slot="108"
eval="4">
<ifCall baseUri="file:/Users/mike/bugs/2020/4663-Costello/Neural/main.xsl"
ns="activation=activation-functions f=function map=~ math=~ matrix=matrix-functions random=random-number-generator-with-normal-distribution saxon=~ xs=~ xsl=~ xml=~"
name="Q{http://www.w3.org/2005/xpath-functions/map}get"
type="*">
<varRef name="Q{}self" slot="0"/>
<str val="wih"/>
</ifCall>
<let var="Q{}N" slot="109" eval="6">
<varRef baseUri="file:/Users/mike/bugs/2020/4663-Costello/Neural/main.xsl"
ns="activation=activation-functions f=function map=~ math=~ matrix=matrix-functions random=random-number-generator-with-normal-distribution saxon=~ xs=~ xsl=~ xml=~"
name="Q{}learning-rate-multiplied-by-weight-changes"
slot="106"/>
<let module="file:/Users/mike/bugs/2020/4663-Costello/Neural/matrix-operations.xsl"
line="155"
var="Q{http://saxon.sf.net/generated-variable}v0"
as="*ADI"
slot="110"
eval="4">
<to>
<int val="1"/>
<treat as="ADI" diag="1|1||to">
<check card="?" diag="1|1||to">
<cvUntyped to="ADI">
<data diag="1|1||to">
<subscript>
<varRef name="Q{}M" slot="108"/>
<int val="2"/>
</subscript>
</data>
</cvUntyped>
</check>
</treat>
</to>
<sequence line="154">
<first line="155">
<varRef name="Q{}M" slot="108"/>
</first>
<subscript line="155">
<varRef name="Q{}M" slot="108"/>
<int val="2"/>
</subscript>
<str val="wih"/>
<for line="155" var="Q{}i" as="ADI" slot="111">
<to role="in">
<int val="1"/>
<treat as="ADI" diag="1|1||to">
<check card="?" diag="1|1||to">
<cvUntyped to="ADI">
<data diag="1|1||to">
<first>
<varRef name="Q{}M" slot="108"/>
</first>
</data>
</cvUntyped>
</check>
</treat>
</to>
<for role="return" var="Q{}j" as="ADI" slot="112">
<varRef role="in"
name="Q{http://saxon.sf.net/generated-variable}v0"
slot="110"/>
<arith role="return" op="+" calc="d+d">
<let var="Q{}M" slot="113" eval="6">
<varRef name="Q{}M" slot="108"/>
<let var="Q{}row" as="1ADI" slot="114" eval="6">
<varRef name="Q{}i" slot="111"/>
<let var="Q{}col" as="1ADI" slot="115" eval="6">
<varRef name="Q{}j" slot="112"/>
<check line="21" card="1" diag="5|0|XTTE0780|f:cell#3">
<convert from="A" to="AO" flags="p" diag="5|0|XTTE0780|f:cell#3">
<cvUntyped to="AO" diag="5|0|XTTE0780|f:cell#3">
<data diag="5|0|XTTE0780|f:cell#3">
<subscript>
<varRef name="Q{}M" slot="113"/>
<arith op="+" calc="a+a">
<arith op="+" calc="a+a">
<arith op="*" calc="a*a">
<arith op="-" calc="i-i">
<varRef name="Q{}row" slot="114"/>
<int val="1"/>
</arith>
<cvUntyped to="AO">
<atomSing card="?" diag="1|1||*">
<subscript>
<varRef name="Q{}M" slot="113"/>
<int val="2"/>
</subscript>
</atomSing>
</cvUntyped>
</arith>
<varRef name="Q{}col" slot="115"/>
</arith>
<int val="3"/>
</arith>
</subscript>
</data>
</cvUntyped>
</convert>
</check>
</let>
</let>
</let>
<let var="Q{}M" slot="116" eval="6">
<varRef name="Q{}N" slot="109"/>
<let var="Q{}row" as="1ADI" slot="117" eval="6">
<varRef name="Q{}i" slot="111"/>
<let var="Q{}col" as="1ADI" slot="118" eval="6">
<varRef name="Q{}j" slot="112"/>
<check line="21" card="1" diag="5|0|XTTE0780|f:cell#3">
<convert from="A" to="AO" flags="p" diag="5|0|XTTE0780|f:cell#3">
<cvUntyped to="AO" diag="5|0|XTTE0780|f:cell#3">
<data diag="5|0|XTTE0780|f:cell#3">
<subscript>
<varRef name="Q{}M" slot="116"/>
<arith op="+" calc="a+a">
<arith op="+" calc="a+a">
<arith op="*" calc="a*a">
<arith op="-" calc="i-i">
<varRef name="Q{}row" slot="117"/>
<int val="1"/>
</arith>
<cvUntyped to="AO">
<atomSing card="?" diag="1|1||*">
<subscript>
<varRef name="Q{}M" slot="116"/>
<int val="2"/>
</subscript>
</atomSing>
</cvUntyped>
</arith>
<varRef name="Q{}col" slot="118"/>
</arith>
<int val="3"/>
</arith>
</subscript>
</data>
</cvUntyped>
</convert>
</check>
</let>
</let>
</let>
</arith>
</for>
</for>
</sequence>
</let>
</let>
</let>
...
</let>
Nothing obviously amiss here: we might need to actually execute it in the debugger to see what's going on internally.
Before we do that, back to the 9.9 code, which shows
(a) the same matrix:addition operation takes almost no time at all (50ms)
(b) the call to matrix:addition() is not inlined in this case
(c) the code generated for matrix:addition() looks similar:
<function name='Q{matrix-functions}addition' line='150' module='file:/Users/mike/bugs/2020/4663-Costello/Neural/matrix-operations.xsl' eval='4' flags='pU' as='item()*' slots='12'>
<arg name='Q{}M' as='item()*'/>
<arg name='Q{}N' as='item()*'/>
<arg name='Q{}name-of-result-matrix' as='xs:string'/>
<let role='body' baseUri='file:/Users/mike/bugs/2020/4663-Costello/Neural/matrix-operations.xsl' ns='xsl=~ xs=~ math=~ activation=activation-functions matrix=matrix-functions f=local-function' line='155' var='Q{http://saxon.sf.net/generated-variable}v0' as='xs:integer*' slot='3' eval='4'>
<to>
<int val='1'/>
<treat as='xs:integer' diag='1|1||to'>
<check card='?' diag='1|1||to'>
<cvUntyped to='xs:integer'>
<data>
<subscript>
<varRef name='Q{}M' slot='0'/>
<int val='2'/>
</subscript>
</data>
</cvUntyped>
</check>
</treat>
</to>
<sequence line='154'>
<first line='155'>
<varRef name='Q{}M' slot='0'/>
</first>
<subscript line='155'>
<varRef name='Q{}M' slot='0'/>
<int val='2'/>
</subscript>
<varRef line='155' name='Q{}name-of-result-matrix' slot='2'/>
<for line='155' var='Q{}i' as='xs:integer' slot='4'>
<to role='in'>
<int val='1'/>
<treat as='xs:integer' diag='1|1||to'>
<check card='?' diag='1|1||to'>
<cvUntyped to='xs:integer'>
<data>
<first>
<varRef name='Q{}M' slot='0'/>
</first>
</data>
</cvUntyped>
</check>
</treat>
</to>
<for role='return' var='Q{}j' as='xs:integer' slot='5'>
<varRef role='in' name='Q{http://saxon.sf.net/generated-variable}v0' slot='3'/>
<arith role='return' op='+' calc='d+d'>
<let var='Q{}M' as='item()*' slot='6' eval='6'>
<varRef name='Q{}M' slot='0'/>
<let var='Q{}row' as='xs:integer' slot='7' eval='6'>
<varRef name='Q{}i' slot='4'/>
<let var='Q{}col' as='xs:integer' slot='8' eval='6'>
<varRef name='Q{}j' slot='5'/>
<check line='21' card='1' diag='5|0|XTTE0780|f:cell#3'>
<convert from='xs:anyAtomicType' to='xs:double' flags='p' diag='5|0|XTTE0780|f:cell#3'>
<cvUntyped to='xs:double' diag='5|0|XTTE0780|f:cell#3'>
<data>
<subscript>
<varRef name='Q{}M' slot='6'/>
<arith op='+' calc='a+a'>
<arith op='+' calc='a+a'>
<arith op='*' calc='a*a'>
<arith op='-' calc='i-i'>
<varRef name='Q{}row' slot='7'/>
<int val='1'/>
</arith>
<cvUntyped to='xs:double'>
<atomSing card='?' diag='1|1||*'>
<subscript>
<varRef name='Q{}M' slot='6'/>
<int val='2'/>
</subscript>
</atomSing>
</cvUntyped>
</arith>
<varRef name='Q{}col' slot='8'/>
</arith>
<int val='3'/>
</arith>
</subscript>
</data>
</cvUntyped>
</convert>
</check>
</let>
</let>
</let>
<let var='Q{}M' as='item()*' slot='9' eval='6'>
<varRef name='Q{}N' slot='1'/>
<let var='Q{}row' as='xs:integer' slot='10' eval='6'>
<varRef name='Q{}i' slot='4'/>
<let var='Q{}col' as='xs:integer' slot='11' eval='6'>
<varRef name='Q{}j' slot='5'/>
<check line='21' card='1' diag='5|0|XTTE0780|f:cell#3'>
<convert from='xs:anyAtomicType' to='xs:double' flags='p' diag='5|0|XTTE0780|f:cell#3'>
<cvUntyped to='xs:double' diag='5|0|XTTE0780|f:cell#3'>
<data>
<subscript>
<varRef name='Q{}M' slot='9'/>
<arith op='+' calc='a+a'>
<arith op='+' calc='a+a'>
<arith op='*' calc='a*a'>
<arith op='-' calc='i-i'>
<varRef name='Q{}row' slot='10'/>
<int val='1'/>
</arith>
<cvUntyped to='xs:double'>
<atomSing card='?' diag='1|1||*'>
<subscript>
<varRef name='Q{}M' slot='9'/>
<int val='2'/>
</subscript>
</atomSing>
</cvUntyped>
</arith>
<varRef name='Q{}col' slot='11'/>
</arith>
<int val='3'/>
</arith>
</subscript>
</data>
</cvUntyped>
</convert>
</check>
</let>
</let>
</let>
</arith>
</for>
</for>
</sequence>
</let>
</function>
Updated by Michael Kay over 4 years ago
Comparison of the SEF output for the matrix addition function in 9.9 and 10.1 shows they are identical (excluding known differences in SEF details).
Updated by Michael Kay over 4 years ago
I'm not sure what I've changed, but I'm now seeing an internal error while processing the second training record:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4
at net.sf.saxon.expr.instruct.BlockIterator.next(BlockIterator.java:50)
at net.sf.saxon.om.MemoSequence.itemAt(MemoSequence.java:143)
at net.sf.saxon.expr.SubscriptExpression.evaluateItem(SubscriptExpression.java:168)
at net.sf.saxon.expr.Expression.iterate(Expression.java:872)
This occurs while evaluating $N[2]
on line 101 of matrix-operations.xsl
The XSLT code here is probably a little unusual in that it is trying to read the second item of a lazily-evaluated variable before evaluating the first. The code is designed to handle this, of course, but it may be a path that is not executed very often.
After investigation, the change that led to this crash was to add the line
<xsl:comment expand-text="yes">matrix-addition {$M[1]} x {$M[2]}</xsl:comment>
into the matrix:addition function, immediately after the param declarations. The intent of this was to trigger diagnostics. The crash must be some kind of bug, but it's unrelated to this performance investigation, so I'm going to put it to one side for now. (I think that the comment triggered a type error, which was ignored because errors when evaluating xsl:message are not fatal; but the error recovery left the BlockIterator in some kind of inconsistent state.)
Updated by Michael Kay over 4 years ago
I have been running side-by-side debugging using 9.9 and 10.1 to try and detect any difference in the internal paths taken. Without success. Except that I did notice that there was a slight hiccup on the 10.1 side while evaluating $unpdated_wih, it seems to take longer than 9.9 to populate the reservoir of the MemoSequence with the 78403 items returned by the matrix:addition
function. I need to check this more carefully. Note that this time is not spent within the function, it is spent reading off the results, since the results are returned as a lazy iterator.
Updated by Michael Kay over 4 years ago
I added internal instrumentation to count the numbers of calls to user functions (to get away from any side-effects of injecting trace calls). Function inlining and bytecode generation are disabled for both versions. The number of function calls is exactly the same:
COUNTERS
scalar-multiplication = 4
normal = 79400
col = 160524
gauss = 79400
max = 10
Hadamard_product = 8
query = 10
dot = 160320
neuralNetwork = 1
cell = 1596228
sequence = 2
test-the-neural-network = 11
sigmoid = 14
create = 16
dot-product = 30
difference = 2
train-the-neural-network = 3
row = 160320
transpose = 6
scalar-difference = 4
train = 2
addition = 4
Yet 9.9 runs in 11.76s, 10.1 in 56.55s.
Updated by Michael Kay over 4 years ago
More counters that are invariant between 9.9 and 10.1:
COUNTERS
ProgressiveIterator.next() = 3134420
MemoClosure = 321122
MemoSequence = 320986
FirstItemExpr.evaluateItem() = 160730
SubscriptExpr.evaluateItem() = 5740466
For this one we get a very slight variation:
9.9: ArithmeticExpr.evaluateItem() = 9913336
10.1: ArithmeticExpr.evaluateItem() = 9914848
It's a small difference but is it significant? We're clutching at straws...
Drilliing down to the locations of the arithmetic expressions executed, the differences seem to be in random:gauss() and are explained by the fact that the number of arithmetic operations is actually slightly random.
Updated by Michael Kay over 4 years ago
Let's go back to to the output of Java profiling.
9.9 has as its top 20:
1 49.94% 49.94% 1254 300052 java.net.SocketInputStream.socketRead0
2 16.13% 66.07% 405 300583 net.sf.saxon.value.AtomicValue.<init>
3 4.06% 70.13% 102 300584 net.sf.saxon.tree.iter.ListIterator.next
4 3.31% 73.44% 83 300587 net.sf.saxon.value.Closure.head
5 3.23% 76.66% 81 300585 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
6 2.11% 78.77% 53 300588 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
7 1.79% 80.57% 45 300513 java.util.HashMap.put
8 1.59% 82.16% 40 300597 net.sf.saxon.om.FocusTrackingIterator.next
9 1.47% 83.63% 37 300505 java.lang.Integer.getChars
10 1.43% 85.07% 36 300594 net.sf.saxon.value.Closure.head
11 1.31% 86.38% 33 300593 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
12 1.19% 87.57% 30 300605 net.sf.saxon.tree.iter.ListIterator.next
13 0.96% 88.53% 24 300037 java.lang.ClassLoader.defineClass1
14 0.72% 89.25% 18 300598 net.sf.saxon.expr.VariableReference.evaluateItem
15 0.64% 89.88% 16 300589 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
16 0.40% 90.28% 10 300596 net.sf.saxon.expr.VariableReference.evaluateItem
17 0.40% 90.68% 10 300062 java.io.FileInputStream.readBytes
18 0.40% 91.08% 10 300508 net.sf.saxon.om.MemoSequence$ProgressiveIterator.materialize
19 0.36% 91.44% 9 300600 net.sf.saxon.expr.ArithmeticExpression.evaluateItem
20 0.36% 91.80% 9 300599 net.sf.saxon.expr.UserFunctionCall.evaluateArguments
while 10.1 has
1 49.98% 49.98% 4918 300056 java.net.SocketInputStream.socketRead0
2 13.72% 63.71% 1350 300559 java.util.ArrayList.<init>
3 11.83% 75.54% 1164 300549 net.sf.saxon.expr.Expression.iterate
4 11.73% 87.26% 1154 300541 net.sf.saxon.expr.Expression.iterate
5 3.59% 90.85% 353 300530 net.sf.saxon.value.NumericValue.<init>
6 1.05% 91.90% 103 300534 net.sf.saxon.expr.VariableReference.evaluateItem
7 1.00% 92.90% 98 300540 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
8 0.95% 93.84% 93 300526 net.sf.saxon.om.FocusTrackingIterator.next
9 0.53% 94.37% 52 300464 java.lang.Integer.getChars
10 0.50% 94.87% 49 300531 net.sf.saxon.expr.VariableReference.evaluateItem
11 0.43% 95.29% 42 300532 net.sf.saxon.expr.ValueComparison.effectiveBooleanValue
12 0.36% 95.65% 35 300037 java.lang.ClassLoader.defineClass1
13 0.30% 95.95% 30 300571 java.util.Arrays$ArrayList.toArray
14 0.30% 96.26% 30 300572 java.util.ArrayList.<init>
15 0.29% 96.55% 29 300467 java.util.HashMap.put
16 0.19% 96.75% 19 300551 net.sf.saxon.tree.iter.ListIterator.next
17 0.16% 96.91% 16 300463 net.sf.saxon.om.MemoSequence$ProgressiveIterator.materialize
18 0.15% 97.06% 15 300543 net.sf.saxon.expr.ArithmeticExpression.evaluateItem
19 0.11% 97.17% 11 300582 net.sf.saxon.expr.ArithmeticExpression.evaluateItem
20 0.11% 97.29% 11 300580 net.sf.saxon.expr.ArithmeticExpression.evaluateItem
Let's use Saxon internal counters to target some of the methods that appear frequently in 10.1 but not in 9.9.
No joy: the numbers again are almost identical:
Int64Value.init() = 171744191
Expression.iterate() = 3964277
ListIterator.next() = 111232137
Closure.head() = 166187760
VarRef.evaluateItem = 173268795
FocusTracker.next() = 110820684
This is really proving very frustrating. We know from the xsl:message timings that there's a massive difference between messages U and W. 9.9 has
2020-08-03T22:06:22.116 - U 78403
2020-08-03T22:06:22.151 - V 78403
2020-08-03T22:06:22.231 - W 78403
when processing the first training record, while 10.1 has
2020-08-03T22:09:53.490 - U 78403
2020-08-03T22:10:02.559 - V 78403
2020-08-03T22:10:15.417 - W 78403
That's 10 seconds between U and V, and another 10 between V and W; and the same when processing the second training record, giving a total difference of 40 seconds which accounts for the entire difference for the runs. The code in question is:
<xsl:message saxon:time="yes">U {count($weight-changes)}</xsl:message>
<xsl:variable name="learning-rate-multiplied-by-weight-changes" select="matrix:scalar-multiplication(map:get($self, 'lr'), $weight-changes, 'learning-rate-times-weight-changes')" as="item()*" />
<xsl:message saxon:time="yes">V {count($learning-rate-multiplied-by-weight-changes)}</xsl:message>
<xsl:variable name="updated-wih" select="matrix:addition(map:get($self, 'wih'), $learning-rate-multiplied-by-weight-changes, 'wih')" as="item()*" />
<xsl:message saxon:time="yes">W {count($updated-wih)}</xsl:message>
It really must be possible to find out where those 10 seconds are being spent!
Updated by Michael Kay over 4 years ago
I implemented some internal instrumentation to measure the average elapsed time in each user-defined function; then I compared the 10.1 and 9.9 timings. For most functions the ratio of 10.1 average execution time to 9.9 average execution time is between 0.5 and 1.5. The outliers are matrix:addition (ratio 87.86) and matrix:scalar-multiplication (ratio 106.73).
Note that this instrumentation prevented functions returning "lazy" results, and this actually led to a substantial speed-up for both 10.1 and 9.9 of the order of 30%.
The outliers confirm what we already knew from the xsl:message measurements, but we now have independent verification, and we have discounted the effect of lazy evaluation.
The two problem functions have very similar code:
<xsl:sequence select="$M[1], $M[2], $name-of-result-matrix,
(for $i in 1 to $M[1], $j in 1 to $M[2] return $scalar * f:cell($M, $i, $j))" />
Perhaps what we need to do now is to execute this code in a standalone environment, separated from its calling stylesheet. We could also try subtractive measurement: removing things from the code to see how the performance changes.
Updated by Michael Kay over 4 years ago
- File function-timings.ods function-timings.ods added
Updated by Michael Kay over 4 years ago
Not so extreme, but the functions f:cell() and f:dot also take about 3 times the 9.9 execution time. This may be a contributory factor to the regression in the more complex matrix functions. But since the number of calls is the same, it's hard to see how this accounts for a factor of 100 increase in the caller.
I tried changing the code of matrix:addition to bind variables to $M[1] and $M[2]. It made almost no difference.
Replacing $M[1]
and $M[2]
by literals (100, 784) doesn't make much difference either.
I tried manually inlining the two calls on f:cell(). This halved the execution time of the function, but it's still a long way short of the 9.9 timing.
Updated by Michael Kay over 4 years ago
The number of RangeIterators is the same for both versions:
COUNTERS
RangeIterator1-1 = 321188
RangeIterator1-1000 = 1
RangeIterator1-100 = 348
RangeIterator1-78400 = 1
RangeIterator0-9 = 13
RangeIterator1-10 = 638
RangeIterator1-784 = 3008
Updated by Michael Kay over 4 years ago
I've finally found an interesting difference between the two releases. If we count how many times SequenceExtent.itemAt() is called, grouped by the Java List
implementation class, 9.9 gives us
java.util.RandomAccessSubList = 2202225
java.util.Arrays$ArrayList = 1181297
java.util.ArrayList = 2206143
while 10.1 has
java.util.RandomAccessSubList = 1
java.util.ArrayList = 5589664
The total number of calls is the same, but the Java implementation is different.
The figures reflect how the kind of List supplied when the SequenceExtent was created:
9.9
SequenceExtent.fromArray = 727
SequenceExtent.fromList java.util.Arrays$ArrayList = 1181234
SequenceExtent.fromList java.util.ArrayList = 2045826
SequenceExtent.fromList java.util.RandomAccessSubList = 2195242
10.1
SequenceType.fromList java.util.ArrayList = 5422290
SequenceType.fromArray = 727
SequenceType.fromList java.util.RandomAccessSubList = 12
(It's a little surprising that the number of itemAt()
calls is almost the same as the number of SequenceExtent
's created.)
The difference appears to arise from SequenceExtent.makeSequenceExtent(List)
, where 9.9 simply wraps the supplied List, but 10.1 makes a copy - which is safer but expensive.
Changing this method to avoid making the copy solves the performance regression: execution time is down to 12s. The question is, is this safe?
(Note, although the difference in List implementation class drew my attention to the issue, it's not the fact that a different implementation is used that directly causes the performance regression; rather this difference revealed the true issue which is that a supplied List was being unnecessarily copied.)
Updated by Michael Kay over 4 years ago
The change to makeSequenceExtent()
to copy the supplied List
seems to have been made in commit #9634; this was part of de-generification and the commit message does not explain the rationale for this specific change. It doesn't appear to originate from a bug report, otherwise it would have been made on the 9.9 branch as well. It was probably motivated by observing that the existing code was unsafe. We should probably make the copy only on paths explicitly available to user code, e.g. XdmValue.makeSequence()
. A copy is certainly NOT needed when we are creating the list from the ProgressiveIterator
of a MemoClosure
, which is the case here.
Updated by Michael Kay over 4 years ago
- Subject changed from Neural Network Performance to Performance regression (neural network application)
- Status changed from New to Resolved
- Applies to branch 10, trunk added
- Fix Committed on Branch 10, trunk added
Updated by O'Neil Delpratt about 4 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 10.2 added
Bug fix applied in the Saxon 10.2 maintenance release.
Updated by O'Neil Delpratt about 4 years ago
- Status changed from Resolved to Closed
Please register to edit this issue