Project

Profile

Help

Bug #4557

saxon:parse-html() called twice

Added by Michael Kay 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Low
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
2020-05-18
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:

Description

When called with -t, the tracing reveals that in executing the following query, the parse-html() function is called twice:

declare namespace saxon = "http://saxon.sf.net/";

declare default element namespace "http://www.w3.org/1999/xhtml";

saxon:parse-html(unparsed-text('file:///xxxx/yyyy/profile.html'))//table[@class=zzzz']//tr[@class = 'mergedtoprow'][th = 'Country']/td//a//text()

Preliminary investigation shows the expression is translated into a call on the key() function. One call on parse-html() occurs while the index is being built, the other occurs during the key lookup.

As a completely separate question, there's no point in building an index if it is only going to be used once, and I'm surprised this isn't recognized as being the case.

History

#1 Updated by Michael Kay 3 months ago

The expression translates into a call on fn:key() in which the call on parse-html() appears in the third argument. The first call on saxon:parse-html() happens while evaluating the arguments of this fn:key() call.

Evaluation of fn:key() (for the first time) then leads to a call on KeyIndex.constructIndex(). This calls NodeSetPattern.selectNodes() to select the nodes that need to be indexed; and this involves re-evaluation of the saxon:parse-html() expression.

It's complicated by the fact that there are two predicates in this query that are both indexable, which results in creation of two separate indexes. If I simplify the query to

saxon:parse-html(unparsed-text('file://xxx/profile.html'))//table//tr [@class = 'mergedtoprow'] /td//text()

then the problem still arises.

#2 Updated by Michael Kay 3 months ago

Out of frustration with the poor messages being output, I have implemented a second optional argument for parse-html() and parse() to supply the base URI. (Perhaps for extensibility it should be a full-blown options map, with base-uri as one of the options).

When I change the query to use a local variable for the URI:

for $u in ('file:///xxx/profile.html', 'file:///xxx/books.html')
return saxon:parse-html(unparsed-text($u), $u)/table/tr [@class = 'mergedtoprow']/td/text()

I'm only getting one call on parse-html() for each document. But it's no longer using a key index, so we've simply bypassed the problem.

Now trying with parse-xml() in place of saxon:parse-html(). It's still using an index. But I'm puzzled that we now don't get any "Building tree... " messages. It seems that saxon:parse() and saxon:parse-html() construct the Builder using Controller.makeBuilder(), whereas fn:parse-xml() uses TreeModel.TINY_TREE.makeBuilder. I can't see any obvious reason for this difference. Changed parse-xml() to use controller.makeBuilder(), and I now get the messages: indeed, I get two messages, as with the parse-html() case.

If I change parse-xml(unparsed-text(XXX)) to doc(XXX), I still get "optimization" to use a key, but this time the document is only built once. I thought that might be because the result of doc() is cached, but this isn't the explanation: internally, there is only one call on doc().

#3 Updated by Michael Kay 3 months ago

The difference between the doc() case and parse-xml(unparsed-text()) appears to be in SlashExpression#378, where expressions starting with a call on the doc() function are handled specially. This affects streaming so I don't think it's wise to change it.

Please register to edit this issue

Also available in: Atom PDF