Project

Profile

Help

Bug #2092

closed

Multiple Definitions for the Same Key performs badly if the key contains duplicates

Added by Ronald Ikes over 10 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2014-06-20
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Maybe I got something completely wrong with XSLT key definitions, but for me both key definitions in the XSL are semantically the same resulting in finding nodes by either given a value of 'id17' or '#id17'

When executing the attached stylesheet on a large (6MB) XML (also attached) it spends a long time in generating the @slowKey@.

This definition is fast

This definition is damn slow

<xsl:key name="SlowKey" match="*" use="@id"/>

Complete Stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
    <xsl:strip-space elements="*"/>

    <!-- This definition is fast -->
    <xsl:key name="FastKey" match="*" use="@id,concat('#', @id)"/>

    <!-- This definition is damn slow -->
    <xsl:key name="SlowKey" match="*" use="@id"/>
    <xsl:key name="SlowKey" match="*" use="concat('#', @id)"/>

    <!-- start with the /PLMXML export root -->
    <xsl:template match="/PLMXML">
        <REPORT>
            <xsl:message select="'DIRECT SEARCH:    ',ProductRevision[@id eq 'id2']/@id"/>
            <xsl:message select="'DIRECT SEARCH:    ',ProductRevision[concat('#',@id) eq '#id2']/@id"/>
            <xsl:message select="'KEY SEARCH no#:   ',key('FastKey', 'id2')/@id"/>
            <xsl:message select="'KEY SEARCH with#: ',key('FastKey', '#id2')/@id"/>
            <!-- before the next line is printed it takes a long long time -->
            <xsl:message select="'KEY SEARCH no#:   ',key('SlowKey', 'id2')/@id"/>
            <xsl:message select="'KEY SEARCH with#: ',key('SlowKey', '#id2')/@id"/>
        </REPORT>
    </xsl:template>

</xsl:stylesheet>

Sample lines from source XML

<Header id="id1"/>
<ProductRevision id="id2">
    <ApplicationRef/>
    <UserData id="id7490">
        <UserValue/>
        ...
        <UserValue>
            <UserList id="id7486">
                <Item/>
                <Item/>
                <Item/>
                ...
                <Item/>
            </UserList>
            </UserValue>
         <UserValue/>
         <UserValue>
            <UserList id="id7487"/>
         </UserValue>
         <UserValue>
            <UserList id="id7488"/>
         </UserValue>
         <UserValue>
            <UserList id="id7489"/>
         </UserValue>
         <UserValue/>
	</UserData>
    <AssociatedForm id="id39388"/>
</ProductRevision>
<ProductRevision id="id31">
    <ApplicationRef/>
    <UserData id="id33">
        <UserValue/>
        <UserValue/>
        <UserValue/>
        <UserValue/>
        <UserValue/>
        <UserValue/>
        <UserValue/>
        <UserValue/>
    </UserData>
</ProductRevision>
<ProductRevision id="id36">
    <ApplicationRef/>
    ...

Files

keybug.zip (224 KB) keybug.zip sample input file contained (6MB) Ronald Ikes, 2014-06-20 10:29
keybug.xsl (1.31 KB) keybug.xsl Ronald Ikes, 2014-06-20 10:50
Actions #1

Updated by Michael Kay over 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Michael Kay

Thanks for reporting it. I've reproduced the results and I can't see any obvious reason for them. I would expect that the composite key might take twice as long to construct because the source document is scanned twice, but I wouldn't expect a bigger difference than that. The slow processing phase is the second phase, where it builds the index for

<xsl:key name="SlowKey" match="*" use="concat('#', @id)"/>

and at first glance I can't see anything obviously wrong here: the code path in the debugger is what I would expect, as is the Java execution profile.

Actions #2

Updated by Michael Kay over 10 years ago

OK, I've identified the cause.

In your data, not all elements have an @id attribute, and therefore (with both indexes) there is an index entry for the key "#" which contains a long list of nodes (all those with no @id attribute). While building a single-component index, or in the first pass of building a multi-component index, Saxon knows that if two elements are encountered with the same key value, the two elements will be encountered in document order, so the second element can be added to the end of the list - the list of nodes with a particular key value is maintained in document order, to reflect the semantics of the key() function. But if a duplicate key is found during a second or subsequent pass, the element needs to be inserted into the list at the correct position, and this is done by comparing the node's document order with all other nodes in the list. Given that in your scenario, this list contains most of the elements in the document, this becomes an O(n^2) process.

I think we have to ask whether this is such an unusual thing to be doing that it is worth fixing.

One fix would be to process all the key definitions in a single pass of the source document, but that's a significant upheaval of the current code.

Another fix would be to append new nodes to the end of the list, and then do a postprocess of the index to sort all node-list entries into document order.

Actions #3

Updated by Michael Kay over 10 years ago

  • Status changed from In Progress to Resolved
  • Found in version changed from 9.1 and 9.5he to 9.5

I have patched the code on the 9.5 and 9.6 branches so the insertion logic starts looking at the end of the existing node list instead of the start. This makes the performance much better in (hopefully) the common case where the new node follows all existing nodes in document order.

Note, the problem also affects all known earlier releases.

Actions #4

Updated by Michael Kay over 10 years ago

  • Subject changed from Multiple Definitions for the Same Key performs very bad if also used with Composite Keys to Multiple Definitions for the Same Key performs badly if the key contains duplicates
Actions #5

Updated by Ronald Ikes over 10 years ago

As you said, "hopefully" ;-). Your fix still seems to handle sequence key definitions and two separate definitions for the same key different. So uses may not reliably use the second method and achieve the same performance on the same thing. From your writing I guess, simple reordering the XMl may have negative effects on the performance.

For me this leads to not using multiple-key definitions but key sequences.

I would prefer to have a solution that processes the definitions in the same way checking all the defined keys in one pass. Looks like you do this for a sequence of keys. Although I don't know your code for me it does not seem so much change to first collect all keys with the same name and process their definitions.

Actions #6

Updated by Michael Kay over 10 years ago

In the case where the multiple key definitions use the same pattern (here match="*") merging them and indexing on a single pass would make sense. Where they use different match patterns, e.g. match="chapter" and match="appendix", this would not necessarily be the case.

Adding complexity to the code for optimization purposes can only be justified if either the situation occurs quite often, or if it occurs occasionally and the gains are dramatic. I'm not convinced that doing more work in this area, beyond the change just implemented, would be justified.

Actions #7

Updated by Ronald Ikes over 10 years ago

Totally agreed.

I haven't though about the different match-attributes. I think common use is to define multiple keys for the same match as a sequence and for different matches as multiple keys.

Actions #8

Updated by O'Neil Delpratt over 10 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in version set to 9.5.1.6

Bug fix applied in Saxon maintenance release 9.5.1.6

Please register to edit this issue

Also available in: Atom PDF