Project

Profile

Help

Bug #5488

closed

Streaming with the union, intersect, and except operators

Added by Michael Kay almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Streaming
Sprint/Milestone:
-
Start date:
2022-05-12
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
10, 11, trunk
Fix Committed on Branch:
11, trunk
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

The vast majority of the tests for streaming with union, intersect, and except operators provide one streamed and one non-streamed operand. This case isn't very interesting; in particular, with intersect and except the result will always be empty, and the only interest is in whether and when we detect this.

The specification, however, describes some much more challenging cases. We should be able to process a union, intersect, or except operator with two consuming operands, for example //book | //magazine or //book[@price gt 10] intersect subsequence(//book, 1, 10). The assumption, explained in a note in the spec, is that the two operands can be evaluated "in parallel" during a single pass of the input, and the results combined on the fly. We make no attempt to implement this, as far as I am aware.

Cases involving climbing operands are also interesting: the spec gives the example parent::A | */ancestor::B

Actions #1

Updated by Michael Kay almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Michael Kay almost 2 years ago

It doesn't actually look very difficult to do this. We handle streaming expressions like //book and //magazine by treating them as patterns and matching them against nodes in the streamed document; we already have the capability to construct a VennPattern representing the union, intersection, or difference between two patterns.

Actions #3

Updated by Michael Kay almost 2 years ago

I've added tests sx-union-300 to -335 that test unions between (for the most part) two consuming operands, and a good number of them work as written. However, some are failing and need further examination to see what the spec says.

For example test -317 does

<xsl:value-of select="(/BOOKLIST/BOOKS/ITEM[1]/* union /BOOKLIST/BOOKS/ITEM[2]/*) ! contains(name(), 'E')"/>

and is failing

SXST0061  Navigation using child axis is not supported from a streamed input node

This failure seems spurious. The failure typically occurs because static analysis has concluded that the expression is streamable, but the run-time code fails to provide a streaming implementation. In particular, the union expression cannot be converted to a pattern because the filtered operands cannot be converted to patterns, which is ultimately because SubscriptExpressionAdjunct.toStreamingPattern() returns null (as does FirstItemExpressionAdjunct).

We implement streaming of an expression like /BOOKLIST/BOOKS/ITEM[1]/* not by generating a pattern that matches these nodes and nothing else, but by matching /BOOKLIST/BOOKS/ITEM nodes and counting them as they pass by. Union expressions seem to be streamable only if the operands can both be treated as patterns.

To make this test work, we need to be smarter than this. Essentially we need to treat it like an xsl:fork, where the two operands are evaluated independently (as separate Watches in the WatchManager) and the two feeds are then merged into one. This isn't a completely new concept, we do much the same thing for merging entries in a streamed map constructor, but it's certainly new code.

Actions #4

Updated by Michael Kay almost 2 years ago

I've made good progress in allowing a union of striding sequences like (A/B/C union A/B/D), by having a watch for each operand feeding into a CombiningUnionFeed that forms the union of the two sequences. This is straightforward enough because the nodes arrive in the right order, and if the same node arrives from both inputs, they will be adjacent which makes it easy to detect the duplicate.

I'm having more difficulty with sequences that use the ancestor axis, for example

ITEM[1]/ancestor::* | ITEM[2]/ancestor::*

where the nodes arrive in the CombiningUnionFeed in the order BOOKLIST BOOKS ITEM[1] BOOKLIST BOOKS ITEM[2], and the BOOKLIST and BOOKS elements need to be de-duplicated.

The answer may be for the CombiningUnionFeed to accumulate nodes so long as they derive from the same WatchManager event (we could detect that by maintaining an event counter in the WatchManager).

Actions #5

Updated by Michael Kay almost 2 years ago

Investigating sx-union-331 which does

select="(/BOOKLIST/BOOKS/ITEM[1]/PRICE/ancestor::* union /BOOKLIST/BOOKS/ITEM[2]/PRICE/ancestor::*) ! name()

there seems to be a problem with the streamed evaluation of the two operands. An expression of the form child::*/ancestor::* is handled by a DocumentSorterAdjunct which is supposed to sort the ancestors of a node before the node itself; but in fact it seems to be sorting the entire node sequence, meaning that it's not really streamed at all.

Considering the first expression on its own,

/BOOKLIST/BOOKS/ITEM[1]/PRICE/ancestor::*

it seems that when ITEM[1] is processed, a watch is created to catch PRICE, which fires and computes PRICE/ancestor::*, which sits around in the DocumentSorter until another event comes along. When ITEM[2] is processed, the watch correctly recognises that it doesn't need to do anything; it also recognises that there will be no further matches on future ITEMS; but it doesn't send any close() event up the pipeline until the end of the sequence of ITEMs is notified.

So in combining the two branches of the union, all the events for BOOKLIST BOOKS ITEM[1] BOOKLIST BOOKS ITEM[2] actually arrive when the endDocument() is processed.

Actions #6

Updated by Michael Kay almost 2 years ago

With some careful tweaking to ensure that the combined stream of items is only closed once, I've now got the correct elements in the result: BOOKLIST BOOKS ITEM ITEM. The only problem is, the two items are in the wrong order. This is because the CombiningUnionFeed is designed to eliminate duplicates, but not to do a full sort into document order.

Items arrive at the CombiningUnionFeed in the order BOOKLIST BOOKS ITEM[2] BOOKLIST BOOKS ITEM[1] which implies we need to do a full sort into document order, which obviously isn't streamable.

Perhaps, though, we only need to do a full sort for items associated with the same WatchManager event (in this case all the items are associated with the endDocument() event, but that won't always be true. It's only true here because all the items in the incoming streams were buffered.)

This works for this test case...

Actions #7

Updated by Michael Kay almost 2 years ago

Now looking at sx-union-334 which does

/BOOKLIST/BOOKS/ITEM/PRICE/ancestor-or-self::*/@CAT union /BOOKLIST/BOOKS/ITEM/QUANTITY/ancestor-or-self::*/@CAT

and -335 which does

//PRICE/ancestor-or-self::*/@* union //QUANTITY/ancestor-or-self::*/@*

The problem is eliminating duplicates. Looking at the second case, we're going to get a sequence of attributes by looking upwards from a PRICE element, followed by another sequence looking upwards from a QUANTITY. How can we eliminate duplicates with buffering the entire sequence? The technique of flushing the buffer if the WatchManager position has moved doesn't work - it has moved from PRICE to QUANTITY but still we can get duplicates.

Surely it can't be that difficult? We have two sequences of nodes coming from separate sources, each of which is in document order. We can flush a node N1 from sequence S1 as soon as we see a node from S2 that is beyond N1 in document order. The problem about this rule is that we currently don't know which operand expression the incoming nodes come from, because they both arrive via the same append() call. So we need to find a way of distinguishing the two input sequences.

Actions #8

Updated by Michael Kay almost 2 years ago

  • Status changed from New to In Progress
  • Priority changed from Low to Normal
  • Applies to branch 10, 11, trunk added
  • Fix Committed on Branch 11, trunk added
  • Platforms .NET, Java added

I now have all the new streaming union tests working, and have committed/pushed the required changes on the 11.x and 12 branches (I'm not proposing to retrofit to 10.x).

The same design is applicable to intersect and except, and I'm now moving on to those.

Actions #9

Updated by Michael Kay almost 2 years ago

Tests for intersect with two streaming operands have been added, and are now working.

Actions #10

Updated by Michael Kay almost 2 years ago

The "except" case is giving me a little trouble. Given a case like (sx-except-316):

<xsl:copy-of select="(WEIGHT except *[@UNIT = 'oz'])"/>

we receive first a startSelectedParentNode() call for the first operand, immediately followed by a startSelectedParentNode() call for the second (assuming the same node satisfies both conditions). We can't register a receiver for the CopyOf action when the first event occurs until we know that the second event isn't going to occur. We need to start the CopyOf process while handling the start() event, but we can't respond to the absence of an event. (This is different from the intersect case where we only need to take action if there are two events).

Can we perhaps arrange things so the event for the second operand is guaranteed to come first?

Or can we rewrite the "except" as an "intersect", so we are notified of all events that are NOT selected by the second operand?

(Actually, this test -016 works, because the except expression can be translated to an equivalent motionless pattern. It's more complex conditions like test -017 where the problem arises.)

Actions #11

Updated by Michael Kay almost 2 years ago

Putting the watch for the second operand first on the watchlist seems to work for most of the tests, but not for sx-except-312, where the event for the first watch is still being reported before the second. The reason for this is that the second operand first adds a watch for BOOKS elements, and only when this is matched does it add a watch for ITEM elements, and this watch is queued up behind the watch for the first operand.

Perhaps the startSelectedParentNode() action for the LHS operand could add a watch to the WatchManager that matches the current element; this will automatically be added to the end of the list, and will fire for this node after all the other startElement events; it can then see whether there have been any events for the second operand, and immediately remove the watch.

Actions #12

Updated by Michael Kay almost 2 years ago

  • Status changed from In Progress to Resolved

The new tests (sx-UnionExpr-3xx, sx-IntersectExpr-3xx, sx-ExceptExpr-3xx) are all now working.

Changes committed on the 11.x and 12.x branches. Although this bug represents a W3C spec non-conformance, it has been present in the product for many years without anyone complaining, so I don't propose to retrofit the changes to the 10.x branch, which we are now managing for high stability.

Actions #13

Updated by Debbie Lockett over 1 year ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 11.4 added

Bug fix applied in the Saxon 11.4 maintenance release.

Please register to edit this issue

Also available in: Atom PDF