Bug #4557


saxon:parse-html() called twice
100%
Description
When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:
declare namespace saxon = "http://saxon.sf.net/";
declare default element namespace "http://www.w3.org/1999/xhtml";
saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()
Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.
As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.
History
#1
Updated by Michael Kay 8 months ago
The expression translates into a call on fn:key()
in which the call on parse-html() appears in the third argument. The first call on saxon:parse-html()
happens while evaluating the arguments of this fn:key()
call.
Evaluation of fn:key()
(for the first time) then leads to a call on KeyIndex.constructIndex()
. This calls NodeSetPattern.selectNodes()
to select the nodes that need to be indexed; and this involves re-evaluation of the saxon:parse-html()
expression.
It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to
saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()
then the problem still arises.
#2
Updated by Michael Kay 8 months ago
Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).
When I change the query to use a local variable for the URI:
for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html')
return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()
I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.
Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use controller.makeBuilder()
, and I now get the messages: indeed, I get two messages, as with the parse-html() case.
If I change parse-xml(unparsed-text(XXX))
to doc(XXX)
, I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of doc()
is cached, but this isn't the explanation: internally, there is only one call on doc()
.
#3
Updated by Michael Kay 8 months ago
The difference between the doc()
case and parse-xml(unparsed-text())
appears to be in SlashExpression#378, where expressions starting with a call on the doc()
function are handled specially. This affects streaming so I don't think it's wise to change it.
#4
Updated by Michael Kay 4 months ago
- Category set to Performance
- Status changed from New to In Progress
Coming back to this, I'm using the query
parse-xml('<a><b>2</b><b>3</b></a>')/a/b[.='2']
The tree is being built twice. The -explain output shows the rewritten query as
<query>
<key name="Q{http://saxon.sf.net/}kk101" line="0" flags="u">
<p.nodeSet test="NE nQ{}b">
<slash baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1">
<slash simple="1">
<fn name="parse-xml">
<str val="<a><b>2</b><b>3</b></a>"/>
</fn>
<axis name="child" nodeTest="NE nQ{}a"/>
</slash>
<axis name="child" nodeTest="NE nQ{}b"/>
</slash>
</p.nodeSet>
<cast baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1"
flags="a"
as="1AS">
<data diag="1|0||=">
<dot type="1NE nQ{}b"/>
</data>
</cast>
</key>
<globalVariables/>
<body>
<docOrder baseUri="file:/Users/mike/team/xmark/"
ns="err=~ fn=~ local=http://www.w3.org/2005/xquery-local-functions saxon=~ xs=~ xsi=~ xml=~"
line="1"
intra="1">
<for var="Q{http://saxon.sf.net/generated-variable}dd1161082381"
as="ND"
slot="0">
<fn role="in" name="parse-xml">
<str val="<a><b>2</b><b>3</b></a>"/>
</fn>
<fn role="return" name="key">
<str val="Q{http://saxon.sf.net/}kk101"/>
<str val="2"/>
<varRef name="Q{http://saxon.sf.net/generated-variable}dd1161082381" slot="0"/>
</fn>
</for>
</docOrder>
</body>
</query>
Stepping through the code, OptimizerEE.convertFilterExpressionToKey()
has logic that avoids rewriting the filter expression to use a key if the filter expression creates a new document. This logic isn't being activated because the function metadata for parse-xml()
doesn't have the NEW property. If we add this property, the rewrite doesn't take place, and the document is only built once.
Adding this property to parse-xml()
, parse-xml-fragment()
, saxon:parse()
, saxon:parse-html()
, saxon:new-NNNN()
.
#5
Updated by Michael Kay 4 months ago
- Status changed from In Progress to Resolved
- Priority changed from Low to Normal
- Applies to branch 10, trunk added
- Fix Committed on Branch 10, trunk added
#6
Updated by O'Neil Delpratt 3 months ago
Bug fix applied in the Saxon 10.3 maintenance release
#7
Updated by O'Neil Delpratt 3 months ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 10.3 added
Please register to edit this issue