Bug #4557
closedsaxon:parse-html() called twice
100%
Description
When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:
declare namespace saxon = "http://saxon.sf.net/";
declare default element namespace "http://www.w3.org/1999/xhtml";
saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()
Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.
As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.
Updated by Michael Kay over 4 years ago
The expression translates into a call on fn:key()
in which the call on parse-html() appears in the third argument. The first call on saxon:parse-html()
happens while evaluating the arguments of this fn:key()
call.
Evaluation of fn:key()
(for the first time) then leads to a call on KeyIndex.constructIndex()
. This calls NodeSetPattern.selectNodes()
to select the nodes that need to be indexed; and this involves re-evaluation of the saxon:parse-html()
expression.
It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to
saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()
then the problem still arises.
Updated by Michael Kay over 4 years ago
Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).
When I change the query to use a local variable for the URI:
for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html')
return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()
I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.
Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use controller.makeBuilder()
, and I now get the messages: indeed, I get two messages, as with the parse-html() case.
If I change parse-xml(unparsed-text(XXX))
to doc(XXX)
, I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of doc()
is cached, but this isn't the explanation: internally, there is only one call on doc()
.
Updated by Michael Kay over 4 years ago
The difference between the doc()
case and parse-xml(unparsed-text())
appears to be in SlashExpression#378, where expressions starting with a call on the doc()
function are handled specially. This affects streaming so I don't think it's wise to change it.
Updated by Michael Kay about 4 years ago
- Category set to Performance
- Status changed from New to In Progress
Coming back to this, I'm using the query
parse-xml('<a><b>2</b><b>3</b></a>')/a/b[.='2']
The tree is being built twice. The -explain output shows the rewritten query as
<query>
<key name="Q{http://saxon.sf.net/}kk101" line="0" flags="u">
<p.nodeSet test="NE nQ{}b">
<slash baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1">
<slash simple="1">
<fn name="parse-xml">
<str val="<a><b>2</b><b>3</b></a>"/>
</fn>
<axis name="child" nodeTest="NE nQ{}a"/>
</slash>
<axis name="child" nodeTest="NE nQ{}b"/>
</slash>
</p.nodeSet>
<cast baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1"
flags="a"
as="1AS">
<data diag="1|0||=">
<dot type="1NE nQ{}b"/>
</data>
</cast>
</key>
<globalVariables/>
<body>
<docOrder baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1"
intra="1">
<for var="Q{http://saxon.sf.net/generated-variable}dd1161082381"
as="ND"
slot="0">
<fn role="in" name="parse-xml">
<str val="<a><b>2</b><b>3</b></a>"/>
</fn>
<fn role="return" name="key">
<str val="Q{http://saxon.sf.net/}kk101"/>
<str val="2"/>
<varRef name="Q{http://saxon.sf.net/generated-variable}dd1161082381" slot="0"/>
</fn>
</for>
</docOrder>
</body>
</query>
Stepping through the code, OptimizerEE.convertFilterExpressionToKey()
has logic that avoids rewriting the filter expression to use a key if the filter expression creates a new document. This logic isn't being activated because the function metadata for parse-xml()
doesn't have the NEW property. If we add this property, the rewrite doesn't take place, and the document is only built once.
Adding this property to parse-xml()
, parse-xml-fragment()
, saxon:parse()
, saxon:parse-html()
, saxon:new-NNNN()
.
Updated by Michael Kay about 4 years ago
- Status changed from In Progress to Resolved
- Priority changed from Low to Normal
- Applies to branch 10, trunk added
- Fix Committed on Branch 10, trunk added
Updated by O'Neil Delpratt about 4 years ago
Bug fix applied in the Saxon 10.3 maintenance release
Updated by O'Neil Delpratt about 4 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 10.3 added
Please register to edit this issue