Project

Profile

Help

Bug #6326

closed

Test stream-210: intermittent failure, possibly a multithreading problem

Added by Michael Kay 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Multithreading
Sprint/Milestone:
-
Start date:
2024-01-24
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
12, trunk
Fix Committed on Branch:
12, trunk
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

We have seen XSLT4.0 test case stream-210 fail as follows:

-s:source-document -t:stream-210
java.lang.NullPointerException
	at net.sf.saxon.expr.VariableReference.evaluateItem(VariableReference.java:610)
	at com.saxonica.ee.stream.StreamInstr.push(StreamInstr.java:272)
	at com.saxonica.ee.stream.StreamInstr$StreamInstrElaborator.lambda$elaborateForPush$0(StreamInstr.java:348)
	at net.sf.saxon.expr.instruct.FixedElement$FixedElementElaborator.lambda$elaborateForPush$0(FixedElement.java:640)
	at net.sf.saxon.expr.instruct.ResultDocument.processInstruction(ResultDocument.java:403)
	at com.saxonica.config.EnterpriseConfiguration.lambda$processResultDocument$1(EnterpriseConfiguration.java:1926)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

I haven't been able to reproduce the problem, but from the stack trace it looks suspiciously like a multithreading issue.

Actions #1

Updated by Michael Kay 4 months ago

Let's start with what the test is doing:

    <xsl:param name="extract-products" as="xs:string*" select="'product-1', 'product-3'"/>
    
    <xsl:param name="input-uri" as="xs:string" select="'stream-210.xml'"/>
    
    <xsl:template name="xsl:initial-template">
        <xsl:iterate select="$extract-products">
            <xsl:variable name="product-type" as="xs:string" select="."/>
            <xsl:result-document href="{$product-type}-result.xml">
                <root>
                    <xsl:source-document streamable="true" href="{$input-uri}">
                        <xsl:for-each select="*/product[@type = $product-type]">
                            <xsl:copy-of select="."/>
                        </xsl:for-each>
                    </xsl:source-document>
                </root>
            </xsl:result-document>
        </xsl:iterate>
    </xsl:template>

So for each of two selected product codes, it is creating a result document, doing a streaming pass over the input to select records with that product type, and outputting information connected with that product. The input XML contains 5 products, including two of type product-1 and one of type product-3.

In the SEF file, the "let" instruction for the variable product-type has been marked with flags="e" indicating that early evaluation is required. I think this is probably because the variable is accessed within the xsl:result-document instruction, which is evaluated asynchronously; evaluating the variable eagerly ensures that its value is available to all threads (in fact, the multiple threads use separate copies of the stack-frame).

Lines 610-612 in VariableReference read:

            Sequence actual = evaluateVariable(c);
            assert actual != null;
            return actual.head();

There are two earlier #ifdef lines that the preprocessor will remove, so I think that we can assume the NPE occurs on the last of these lines, caused by actual being null. The Java assert statement will have no effect unless running with assertions enabled, which is not the default.

StreamInst#272 is

AtomicValue hrefVal = (AtomicValue) getHref().evaluateItem(context);

which is the xsl;source-document instruction evaluating its @href attribute.

I'm slightly surprised that there is no evidence on the stack of the concat() instruction generated by the @href AVT. -- No, I was confusing the two href attributes

Also worth noting is that the StreamInstr code is not fully "elaborated". The push code is calling evaluateItem() to evaluate the @href attribute, without having first elaborated the expression for repeated evaluation. That should work, with a tiny impact on performance, so it's not itself evidence of a problem.

Actions #2

Updated by Michael Kay 3 months ago

So, it's the xsl:source-document/@href attribute, which is a global variable reference, that is coming back null.

This is actually an xsl:param for which no value has been supplied externally, so we will be evaluating the default (which is a string literal).

So I'm looking at GlobalParam.evaluateVariable().

Let's assume that both threads find that there's no value for the parameter in the Bindery. So GroundedValue val = b.getGlobalVariableValue(this); on line 64 returns null. Then getConvertedParameter returns null because no value was externally supplied. So we call actuallyEvaluate() on line 90. At this point global parameters take the same path as global variables.

We now have some multithreading code at GlobalVariable#737. We aren't actually synchronizing; we allow two threads both to evaluate the variable if necessary. But if we know another thread is evaluating the variable, we use Thread.sleep to wait until it has finished. Is this code safe? Instinct says no: it's updating the shared array busy[], and it's very hard to demonstrate convincingly that unsynchronised code that updates shared variables is safe. However, the code is written on the basis that it should work whether setExecuting() returns true or false.

The danger path seems to be where setExecuting() returns false. Can we be absolutely sure that when this happens, the bindery slot will contain a value for the global variable? If it doesn't we''ll see exactly the symptoms on this stack trace.

I don't think we can be sure of this. The thread that sets the "busy" flag does this "before" putting the value in the bindery slot but the Java concurrency model doesn't guarantee that updates to variables happen in the order they are initiated.

Actions #3

Updated by Michael Kay 3 months ago

I think we should ensure by synchronization that in any stylesheet (or query) execution, global variables are only evaluated once. (Aside from any other issues, the current design which allows for a slight possibility of variables being evaluated more than once creates a potential problem with node identity: if the variable is initialized to a new element, then a subsequent reference to the same variable might encounter a "different" (by node-id) element in the variable.)

This is easier said than done, because there's no immediately obvious object to synchronize on. In effect we're synchronizing on the bindery slot, but that isn't actually an object. On solution would be to replace the busy[] array (which has one entry per slot) with an array of objects that are used for no other purpose than synchronization.

But we also need to consider the possibility of deadlock. Circular dependencies among global variable declarations cannot always be detected statically; as well as weak synchronization, the busy array serves the purpose of circularity detection. If different parallel threads evaluate the same group of mutually-dependent variables in different orders, then synchronizing at the level of each variable would deadlock.

Perhaps we can ignore this risk. If the stylesheet has a circular dependency among global variables that's an error that needs to be fixed, and if there's a 99.999% probability we report a clean error message, and a 0.001% probability that we deadlock, that might be OK...

A single lock held when ANY global variable is being evaluated would not be acceptable, because "pipeline-style" stylesheets might spend 95% of their time evaluating global variables and would effectively be single-threaded.

Note that two independent evaluations of the same stylesheet never share global variables (they sometimes could, but they don't). Any synchronization here should only between threads within a single stylesheet evaluation, it should not cause two independent evaluations to lock on each other.

Actions #4

Updated by Michael Kay 3 months ago

I'm inclined to say:

  • Scrap the busy[] array in the Bindery. It's not needed for circularity detection because we do that by examining the context stack (see Controller.registerGlobalVariableDependency).

  • Synchronise GlobalVariable lines 743-747 on the Bindery:

            GroundedValue value = getSelectValue(context, target);
            if (_indexed) {
                value = controller.getConfiguration().obtainOptimizer().makeIndexedValue(value.iterate());
            }
            return b.saveGlobalVariableValue(this, value);
Actions #5

Updated by Michael Kay 3 months ago

The effect of this change is that XSLT3 tests error-0640k and error-0640l fail with a StackOverflow. These tests are designed to test for circular references among global variables, in a situation where the circularity cannot be detected statically (except by some complex data flow analysis).

Specifically, the check when GlobalVariable.actuallyEvaluate on line 734 calls setDependencies() does not appear to be effective.

Relying on the stack of XPathContext objects to detect the circularity is a little bit fragile but I'm having trouble seeing why it isn't working.

Actions #6

Updated by Michael Kay 3 months ago

I've reworked the evaluation of global variables to keep a stack of variables being evaluated for the purpose of detecting cycles.

Now test accumulator-031 appears to be hanging or looping. This test has a circular dependency between an accumulator and a global variable.

On investigation, it's a deadlock. We're evaluating the global variable in one thread, and while still doing so, another thread attempts to evaluate the same variable. This should result in a circularity error, but the deadlock comes first.

An observation: we could have got the deadlock even if there was no circularity, because any attempt to evaluate a global variable synchronises on the Bindery, which means that if evaluation of variable V in thread T triggers evaluation of variable W in thread U, thread U will wait until V has finished, which will never happen.

Perhaps we need some more sophisticated synchronisation approach, such as a java.util.concurrent.Semaphore? But first, let's think carefully about what we are trying to achieve.

Global variables are evaluated when first referenced. Two separate threads may each attempt the first reference. We want to ensure that one of those attempts completes first, and the other waits until evaluation is complete. But we also want to detect circular references, that is, an attempt to evaluate a variable while it is already being evaluated. And as the accumulator example demonstrates, the circularity may occur across threads rather than within a single thread. So how can we distinguish the case where two threads legitimately make "the first reference", and the case where there is a circularity?

I think it comes back to using the context. When a new thread is spawned, it inherits the context of its originator. The context should show that one attempt to evaluate the variable is occurring within the context of another attempt to evaluate the variable, distinguishing this case from the one where the two attempts occur in "peer" threads where one should wait for the other.

I'm reluctant to carry yet more information in the context (which needs to be copied across whenever a new context is created) but I'm starting to feel that's the only way... Alternatively, perhaps we should look more carefully at why simply looking down the "caller" chain in the context object wasn't working.

Actions #7

Updated by Michael Kay 3 months ago

Current state:

  • (a) The bindery holds a stack of variables currently being evaluated
  • (b) Evaluation of the variable isn't synchronised
  • (c) Writing back of the variable is synchronised (the value is discarded if another thread evaluates it first)

This is passing all relevant tests, but it's not quite right yet. It could fail, reporting a spurious circularity, if two threads evaluate the variable at the same time. And the stack doesn't feel right: the push/pop operations in different threads can be interleaved, meaning one thread can pop a value that was pushed by a different thread, which means that circularity detection can give wrong results.

Actions #8

Updated by Michael Kay 3 months ago

OK, I've gone back to the circularity check done by examining the context stack, and have established that the reason it wasn't working is that dynamic function calls were creating a new context without linking it back to the old. With that fixed, the mechanism appears now to be working.

Actions #9

Updated by Michael Kay 3 months ago

  • Status changed from New to In Progress
  • Platforms .NET added

I'm content this is now working on the 12.x branch - need to migrate the changes to main branch. A summary:

  • (a) We no longer use the "busy" flag in the Bindery for circularity checking: this was a belt-and-braces mechanism for when the check on the context stack didn't work.
  • (b) We've fixed the one case we know was broken that caused the context stack not to be maintained properly, namely dynamic function calls.
  • c) If we're evaluating a global variable reference and no value is available, we go ahead and evaluate it. On completion we check (under synchronization) that there is still no value available (it might have been computed in another thread) and if necessary, discard the value we have just computed and use the other one. This means we never hold a long-duration lock while a variable is being evaluated; we do occasionally waste effort by evaluating a variable unnecessarily in multiple threads; but all threads end up using the same value.
Actions #10

Updated by Michael Kay 3 months ago

  • Status changed from In Progress to Resolved
  • Applies to branch trunk added
  • Fix Committed on Branch 12, trunk added

Hopefully this is now fixed on the 12.x and main branches.

I say "hopefully" because the original problem was never reliably reproduced, and the fact that all tests pass does not prove there is no remaining issue.

Please register to edit this issue

Also available in: Atom PDF