Bug #4557: saxon:parse-html() called twice - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #4557

closed

saxon:parse-html() called twice

Added by Michael Kay over 4 years ago. Updated over 4 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2020-05-18

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

10, trunk

Fix Committed on Branch:

10, trunk

Fixed in Maintenance Release:

10.3

Platforms:

Description

When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:

declare namespace saxon = "http://saxon.sf.net/";

declare default element namespace "http://www.w3.org/1999/xhtml";

saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()

Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.

As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.

Actions

Copy link

Updated by Michael Kay over 4 years ago

The expression translates into a call on fn:key() in which the call on parse-html() appears in the third argument. The first call on saxon:parse-html() happens while evaluating the arguments of this fn:key() call.

Evaluation of fn:key() (for the first time) then leads to a call on KeyIndex.constructIndex(). This calls NodeSetPattern.selectNodes() to select the nodes that need to be indexed; and this involves re-evaluation of the saxon:parse-html() expression.

It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to

saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()

then the problem still arises.

Actions

Copy link

Updated by Michael Kay over 4 years ago

Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).

When I change the query to use a local variable for the URI:

for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html')
return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()

I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.

Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use controller.makeBuilder(), and I now get the messages: indeed, I get two messages, as with the parse-html() case.

If I change parse-xml(unparsed-text(XXX)) to doc(XXX), I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of doc() is cached, but this isn't the explanation: internally, there is only one call on doc().

Actions

Copy link

Updated by Michael Kay over 4 years ago

The difference between the doc() case and parse-xml(unparsed-text()) appears to be in SlashExpression#378, where expressions starting with a call on the doc() function are handled specially. This affects streaming so I don't think it's wise to change it.

Actions

Copy link

Updated by Michael Kay over 4 years ago

Category set to Performance
Status changed from New to In Progress

Coming back to this, I'm using the query

parse-xml('<a><b>2</b><b>3</b></a>')/a/b[.='2']

The tree is being built twice. The -explain output shows the rewritten query as

<query>
  <key name="Q{http://saxon.sf.net/}kk101" line="0" flags="u">
    <p.nodeSet test="NE nQ{}b">
      <slash baseUri="file:/Users/mike/team/xmark/"
             ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
             line="1">
        <slash simple="1">
          <fn name="parse-xml">
            <str val="&lt;a&gt;&lt;b&gt;2&lt;/b&gt;&lt;b&gt;3&lt;/b&gt;&lt;/a&gt;"/>
          </fn>
          <axis name="child" nodeTest="NE nQ{}a"/>
        </slash>
        <axis name="child" nodeTest="NE nQ{}b"/>
      </slash>
    </p.nodeSet>
    <cast baseUri="file:/Users/mike/team/xmark/"
          ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
          line="1"
          flags="a"
          as="1AS">
      <data diag="1|0||=">
        <dot type="1NE nQ{}b"/>
      </data>
    </cast>
  </key>
  <globalVariables/>
  <body>
    <docOrder baseUri="file:/Users/mike/team/xmark/"
              ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
              line="1"
              intra="1">
      <for var="Q{http://saxon.sf.net/generated-variable}dd1161082381"
           as="ND"
           slot="0">
        <fn role="in" name="parse-xml">
          <str val="&lt;a&gt;&lt;b&gt;2&lt;/b&gt;&lt;b&gt;3&lt;/b&gt;&lt;/a&gt;"/>
        </fn>
        <fn role="return" name="key">
          <str val="Q{http://saxon.sf.net/}kk101"/>
          <str val="2"/>
          <varRef name="Q{http://saxon.sf.net/generated-variable}dd1161082381" slot="0"/>
        </fn>
      </for>
    </docOrder>
  </body>
</query>

Stepping through the code, OptimizerEE.convertFilterExpressionToKey() has logic that avoids rewriting the filter expression to use a key if the filter expression creates a new document. This logic isn't being activated because the function metadata for parse-xml() doesn't have the NEW property. If we add this property, the rewrite doesn't take place, and the document is only built once.

Adding this property to parse-xml(), parse-xml-fragment(), saxon:parse(), saxon:parse-html(), saxon:new-NNNN().

Actions

Copy link