Project

Profile

Help

XQJ: html page screen scrap

Added by Anonymous about 13 years ago

Legacy ID: #9388554 Legacy Poster: Chris (chr1sbau)

I have been trying to use the XQJ api to screen scrap a html page that has been loaded into dom via HTML tidy. The Query I am try to run is: { for $x in //div[@id='ctl09_RacePanel'] return {$x} } When I run the query it results in just empty , however if I change for $x in .//div[@id='ctl09_RacePanel'] to for $x in //[@id='ctl09_RacePanel'] the query returns the html between the

tag which is what I want, but I need to filter the results down futher but cannot as even //[@id='ctl09_RacePanel']/div/h3 returns empty tags. I have checked my orignal query using the xquisitor gui tool and it works, but just not when I try to implement it using XQJ The XQJ code I am using is: Document doc = fetchPage(); //fetchs and runs html tidy and returns W3C dom document SaxonXQDataSource ds = new SaxonXQDataSource(config); XQConnection con = ds.getConnection(); XQItem item = con.createItemFromNode(doc.getChildNodes().item(1), con.createNodeType()); XQPreparedExpression xpres = con.prepareExpression(queryabove); xpres.bindItem(XQConstants.CONTEXT_ITEM, item); XQResultSequence seq = xpres.executeQuery(); I'm new to xquery and XQJ so i'm not sure if the problem is with my XQJ code or the xquery I'm trying to run.


Replies (1)

RE: XQJ: html page screen scrap - Added by Anonymous about 13 years ago

Legacy ID: #9389790 Legacy Poster: Michael Kay (mhkay)

If you look more closely at your source XML you will almost certainly find that the elements are in a namespace, probably http://www.w3.org/1999/xhtml. So if you want to select elements from this namespace, you will need to start your query with [code]declare default element namespace = "http://www.w3.org/1999/xhtml";[/code] Unfortunately this will have the side-effect of putting your output elements (run and race) in this namespace as well, which is probably not what you want. The workaround is to bind a specific prefix [code]declare namespace h = "http://www.w3.org/1999/xhtml"; [/code] then write [code]for $x in //h:div[...] return ...[/code] This is a weakness in the design of the XQuery language. Please note that this forum isn't really intended for general XQuery coding help that's independent of the Saxon product. You should try the talk @ x-query.com mailing list, or stackoverflow.com.

    (1-1/1)

    Please register to reply