Bug #4502
closedStackOverflowError on Saxon 10.0
100%
Description
The attached ast-to-ebnf.xq is a piece of code from REx grammar tools. It is used when converting a grammar from XML representation back to textual EBNF for different contexts.
Since Saxon 10.0, this code results in a StackOverflowError as shown in attachment stacktrace.log. As far as I can see, this is not related to problems with tail call optimization. The -explain output shows that function b:break-lines has been recognized as tail-recursive. However the stack trace suggests that subList chains may be causing the problem.
It can be reproduced using this command line (not always, but most of the time):
java net.sf.saxon.Query ast-to-ebnf.xq -s:rendered-items.xml 2>stacktrace.log
Files
Updated by Michael Kay over 4 years ago
A tough one to debug because the stack trace isn't telling us where we are in the Saxon code (or in the user query for that matter).
I got a somewhat different stack trace running with -T option, the relevant part reads
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at net.sf.saxon.tree.iter.ListIterator.next(ListIterator.java:51)
at EE_trace_22054798982.process(file:/Users/mike/bugs/2020/4502-Rademacher/ast-to-ebnf.xq:58)
at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:85)
at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:146)
at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:666)
at EE_trace_22054798982.process(file:/Users/mike/bugs/2020/4502-Rademacher/ast-to-ebnf.xq:58)
at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:85)
at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:146)
at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:666)
at EE_trace_22054798982.process(file:/Users/mike/bugs/2020/4502-Rademacher/ast-to-ebnf.xq:58)
at com.saxonica.ee.bytecode.CompiledExpression.process(CompiledExpression.java:85)
at com.saxonica.ee.bytecode.ByteCodeCandidate.process(ByteCodeCandidate.java:146)
at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:666)
which seems to have deep recursion both within the Java SubList,get() and within the user query itself -- specifically the b:break-lines() function, which certainly doesn't look tail-recursive from this evidence. But that may be because -T suppresses tail call optimization.
With bytecode disabled (but still with -T on) the middle of the stack trace looks like this:
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at java.util.SubList.get(AbstractList.java:641)
at net.sf.saxon.tree.iter.ListIterator.next(ListIterator.java:51)
at net.sf.saxon.expr.FirstItemExpression.evaluateItem(FirstItemExpression.java:113)
at net.sf.saxon.expr.parser.Evaluator$6.evaluate(Evaluator.java:173)
at net.sf.saxon.expr.LetExpression.eval(LetExpression.java:537)
at net.sf.saxon.expr.LetExpression.processLeavingTail(LetExpression.java:720)
at net.sf.saxon.expr.instruct.Choose.processLeavingTail(Choose.java:908)
at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:136)
at net.sf.saxon.expr.instruct.ComponentTracer.processLeavingTail(ComponentTracer.java:150)
at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:136)
at net.sf.saxon.expr.instruct.UserFunction.process(UserFunction.java:666)
It appears to be evaluating $todo[1]
at line 67. The value of $todo is a SequenceExtent
wrapping a Java RandomAccessSubList
, containing around 4400 element and text nodes. In the debugger we can see a sequence of calls of ListIterator.next(), with the length of list reducing by one on each call.
So I guess we need to look at how let $todo := $todo[position() > 1]
is working. It gets compiled into a TailExpression, and at runtime it's invoking
public SequenceExtent(SequenceExtent ext, int start, int length) {
value = ext.value.subList(start, start+length);
}
which, it appears is generating a SubList of a SubList of a Sublist etc, 4400 levels deep.
None of the code here is obviously new or changed in 10.0. Presumably there's some change that's causing this particular use cae to take this path when it didn't before.
Updated by Michael Kay over 4 years ago
I can see why it's failing; it's creating a deeply nested set of SubLists, and searching for the first item involves recursing down all these Sublists to find the first item in the one that's most deeply nested.
What I can't see is why it's not failing in exactly the same way in 9.9, The code seems to be following exactly the same paths, and there seem to be no signification changes in this area.
The strategy of creating a SubList to implement tail expressions seems flawed.
We only use this mechanism when using the LAZY_TAIL evaluator, which is only used when a TailExpression is applied to a local variable (suggesting likely use of head-tail recursion). There's a comment in the code to the effect that we're doing this "to avoid creating a deeply-nested tree of Closures" - so we're just creating a deeply-nested tree of SubLists instead.
Back in Saxon 9.5, we didn't use SubList in this way: rather, SequenceExtent kept a start/end position and applied these as offsets into the underlying list. This seems a better mechanism: though I would be inclined to use a variant of SequenceExtent rather than adding the complexity back into the main class.
Perhaps we should also be looking at other implementations of grounded sequences, such as Chain, which exists mainly to handle incremental addition of new items to the end of a sequence (as opposed to the current case, which is concerned with incremental removal from the start). Perhaps both cases could be handled using the mechanism we use for implementing arrays, namely an immutable list implemented as a binary tree, which provides reasonable efficiency both for sequential traversal and direct access, while also allowing efficient immutable append, insert, and remove operations? However, I'm concerned this would slow down the most common operations on sequences for the benefit of less common operations. Furthermore, it would use a lot more memory than the current SequenceExtent
.
Updated by Michael Kay over 4 years ago
- Category set to Performance
- Status changed from New to In Progress
- Assignee set to Michael Kay
- Priority changed from Low to Normal
I've been looking at other places where we use List.subList()
to see if they might have similar problems.
There are a few areas that are certainly worth looking at. These include:
- SimpleArrayItem.subArray()
- ImmList.fromList()
- AtomicArray.subsequence()
- Chain.subsequence()
- MemoSequence - makeExtent(), getResidue()
- ZeroOrMore.subsequence()
- ArrayIterator - materialize(), getResidue()
- ListIterator.getResidue()
Updated by Michael Kay over 4 years ago
I have introduced a new implementation of grounded sequences called SequenceSlice
that maps to a slice of an underlying List. SequenceExtent.subsequence()
and SequenceSlice.subsequence()
both now return a SequenceSlice
.
Since this should give a significant performance boost for applications following a similar head/tail recursion pattern, and since the danger of a stack overflow seems to be as much present in 9.9 as in 10.0, I shall retrofit this change to 9.9 as well.
Updated by Michael Kay over 4 years ago
The change has been retrofitted to 9.9 and appears to be working well. However, the bytecode generation needs to make the same change. We need to replace the call on new SequenceExtent(sequence, start, length)
with a call on sequence.subsequence(start, length)
.
Updated by Michael Kay over 4 years ago
- Status changed from In Progress to Resolved
- Applies to branch 9.9 added
- Fix Committed on Branch 10, 9.9 added
Bytecode generation now updated; marking as resolved.
Updated by Gunther Rademacher over 4 years ago
Thank you for fixing this!
I would like to test the fix on Saxon 10.0, but I can't seem to find it. I was expecting the fix to appear in https://dev.saxonica.com/repos/archive/opensource/latest10 (latest revision is 3460), but it doesn't. Bug #4508 is also marked as "Fix committed on branch 10", but I don't see that fix either.
The 9.9 branch however shows a fix for #4502, revision 3461 in https://dev.saxonica.com/repos/archive/opensource/latest9.9
Updated by Michael Kay over 4 years ago
Thanks for pointing that out. Seems I'm committing to our internal development repository and not to the public one, and the changes need to be migrated across,
Updated by O'Neil Delpratt over 4 years ago
- % Done changed from 0 to 100
- Fixed in Maintenance Release 10.1 added
Bug fix committed in the Saxon 10.1 maintenance release.
Updated by O'Neil Delpratt about 4 years ago
- Status changed from Resolved to Closed
- Fixed in Maintenance Release 10.2, 9.9.1.8 added
- Fixed in Maintenance Release deleted (
10.1)
Bug fix applied on the Saxon 9.9.1.8 maintenance release.
Please register to edit this issue