Project

Profile

Help

Bug #4557

saxon:parse-html() called twice

Added by Michael Kay about 1 year ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2020-05-18
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
10, trunk
Fix Committed on Branch:
10, trunk
Fixed in Maintenance Release:

Description

When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:

declare namespace saxon = "http://saxon.sf.net/";

declare default element namespace "http://www.w3.org/1999/xhtml";

saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()

Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.

As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.

History

#1 Updated by Michael Kay about 1 year ago

The expression translates into a call on fn:key() in which the call on parse-html() appears in the third argument. The first call on saxon:parse-html() happens while evaluating the arguments of this fn:key() call.

Evaluation of fn:key() (for the first time) then leads to a call on KeyIndex.constructIndex(). This calls NodeSetPattern.selectNodes() to select the nodes that need to be indexed; and this involves re-evaluation of the saxon:parse-html() expression.

It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to

saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()

then the problem still arises.

#2 Updated by Michael Kay about 1 year ago

Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).

When I change the query to use a local variable for the URI:

for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html')
return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()

I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.

Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use controller.makeBuilder(), and I now get the messages: indeed, I get two messages, as with the parse-html() case.

If I change parse-xml(unparsed-text(XXX)) to doc(XXX), I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of doc() is cached, but this isn't the explanation: internally, there is only one call on doc().

#3 Updated by Michael Kay about 1 year ago

The difference between the doc() case and parse-xml(unparsed-text()) appears to be in SlashExpression#378, where expressions starting with a call on the doc() function are handled specially. This affects streaming so I don't think it's wise to change it.

#4 Updated by Michael Kay 9 months ago

  • Category set to Performance
  • Status changed from New to In Progress

Coming back to this, I'm using the query

parse-xml('<a><b>2</b><b>3</b></a>')/a/b[.='2']

The tree is being built twice. The -explain output shows the rewritten query as

<query>
  <key name="Q{http://saxon.sf.net/}kk101" line="0" flags="u">
    <p.nodeSet test="NE nQ{}b">
      <slash baseUri="file:/Users/mike/team/xmark/"
             ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
             line="1">
        <slash simple="1">
          <fn name="parse-xml">
            <str val="&lt;a&gt;&lt;b&gt;2&lt;/b&gt;&lt;b&gt;3&lt;/b&gt;&lt;/a&gt;"/>
          </fn>
          <axis name="child" nodeTest="NE nQ{}a"/>
        </slash>
        <axis name="child" nodeTest="NE nQ{}b"/>
      </slash>
    </p.nodeSet>
    <cast baseUri="file:/Users/mike/team/xmark/"
          ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
          line="1"
          flags="a"
          as="1AS">
      <data diag="1|0||=">
        <dot type="1NE nQ{}b"/>
      </data>
    </cast>
  </key>
  <globalVariables/>
  <body>
    <docOrder baseUri="file:/Users/mike/team/xmark/"
              ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
              line="1"
              intra="1">
      <for var="Q{http://saxon.sf.net/generated-variable}dd1161082381"
           as="ND"
           slot="0">
        <fn role="in" name="parse-xml">
          <str val="&lt;a&gt;&lt;b&gt;2&lt;/b&gt;&lt;b&gt;3&lt;/b&gt;&lt;/a&gt;"/>
        </fn>
        <fn role="return" name="key">
          <str val="Q{http://saxon.sf.net/}kk101"/>
          <str val="2"/>
          <varRef name="Q{http://saxon.sf.net/generated-variable}dd1161082381" slot="0"/>
        </fn>
      </for>
    </docOrder>
  </body>
</query>

Stepping through the code, OptimizerEE.convertFilterExpressionToKey() has logic that avoids rewriting the filter expression to use a key if the filter expression creates a new document. This logic isn't being activated because the function metadata for parse-xml() doesn't have the NEW property. If we add this property, the rewrite doesn't take place, and the document is only built once.

Adding this property to parse-xml(), parse-xml-fragment(), saxon:parse(), saxon:parse-html(), saxon:new-NNNN().

#5 Updated by Michael Kay 9 months ago

  • Status changed from In Progress to Resolved
  • Priority changed from Low to Normal
  • Applies to branch 10, trunk added
  • Fix Committed on Branch 10, trunk added

#6 Updated by O'Neil Delpratt 8 months ago

Bug fix applied in the Saxon 10.3 maintenance release

#7 Updated by O'Neil Delpratt 8 months ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 10.3 added

Please register to edit this issue

Also available in: Atom PDF