saxon:parse-html() called twice
When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:
declare namespace saxon = "http://saxon.sf.net/"; declare default element namespace "http://www.w3.org/1999/xhtml"; saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()
Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.
As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.
#1 Updated by Michael Kay 3 months ago
The expression translates into a call on
fn:key() in which the call on parse-html() appears in the third argument. The first call on
saxon:parse-html() happens while evaluating the arguments of this
fn:key() (for the first time) then leads to a call on
KeyIndex.constructIndex(). This calls
NodeSetPattern.selectNodes() to select the nodes that need to be indexed; and this involves re-evaluation of the
It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to
saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()
then the problem still arises.
#2 Updated by Michael Kay 3 months ago
Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).
When I change the query to use a local variable for the URI:
for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html') return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()
I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.
Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use
controller.makeBuilder(), and I now get the messages: indeed, I get two messages, as with the parse-html() case.
If I change
doc(XXX), I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of
doc() is cached, but this isn't the explanation: internally, there is only one call on
Please register to edit this issue