Neophyte question concerning XQuery results.
Added by Anonymous over 16 years ago
Legacy ID: #5145542 Legacy Poster: Jake Gage (dispader)
Hello there, Always loathe to bother already busy people; but I've just been banging my head against a wall for a bit, trying to understand what might be going wrong in one of my first uses of XQuery ever. I was wondering if someone might be willing to lend a hand. I'm attempting to write an application for culling (screen scraping) some specific information from Wikipedia pages, and I just can't seem to understand the results that I'm seeing. But I'll start with a specific question. I've got the following code: <code> package org.ghennom; import java.io.IOException; import java.io.InputStream; import java.net.HttpURLConnection; import java.net.MalformedURLException; import java.net.URL; import java.util.Properties; import javax.xml.transform.OutputKeys; import javax.xml.transform.stream.StreamResult; import net.sf.saxon.Configuration; import net.sf.saxon.dom.DocumentWrapper; import net.sf.saxon.query.DynamicQueryContext; import net.sf.saxon.query.StaticQueryContext; import net.sf.saxon.query.XQueryExpression; import net.sf.saxon.trans.XPathException; import org.w3c.dom.Document; import org.w3c.tidy.Tidy; /** * A screen scraping utility which takes game data from WikiPedia. * * @author Jake Gage <jake.gage@gmail.com> / public class GameDataUtility { /* * Main test method for extracting Wikipedia data via screen scraping. * * @param args / public static void main(String[] args) { // Set up test variables. // String extractURL = "http://en.wikipedia.org/wiki/Defender_(arcade_game)"; String firstQueryString = "<results> { for $dataTable in // \n" + " where contains($dataTable/@class, "infobox vevent") \n" + " return $dataTable } </results> \n"; String secondQueryString = "<results> { for $dataTable in //table \n" + " where contains($dataTable/@class, "infobox vevent") \n" + " return $dataTable } </results> \n"; // Run the tests. // System.out.println("\n :: testing first query :: \n"); executeSimpleXPathQuery(extractURL, firstQueryString); System.out.println("\n :: testing second query :: \n"); executeSimpleXPathQuery(extractURL, secondQueryString); } /** * Attempts to retrieve an HTTP response from the given URL request * location, and apply the given XPath query, printing out the results to * <code>stdout</code>. / private static void executeSimpleXPathQuery(String urlString, String xQueryString) { // Open a stream on the URL. // URL wikipediaGameURL = null; try { wikipediaGameURL = new URL(urlString); } catch (MalformedURLException e) { System.err.println("source URL not well-formed: " + urlString); System.exit(1); } InputStream wikipediaGamePageInputStream = null; try { HttpURLConnection wikipediaGameURLConnection = (HttpURLConnection)wikipediaGameURL.openConnection(); wikipediaGamePageInputStream = wikipediaGameURLConnection.getInputStream(); } catch (IOException ioException) { System.err.println("error opening URL connection: " + ioException.getMessage()); System.exit(1); } // Create a Tidy object. // Tidy tidy = new Tidy(); tidy.setQuiet(true); tidy.setShowWarnings(false); Document tidyDom = tidy.parseDOM(wikipediaGamePageInputStream, null); System.out.println("successfully fetched and tidied URL: " + urlString); // Set up some XQuery expression. // // Now that all that's done, attempt to apply an XQuery expression. // final Configuration configuration = new Configuration(); final StaticQueryContext sqc = new StaticQueryContext(configuration); XQueryExpression xQueryExpression = null; try { xQueryExpression = sqc.compileQuery(xQueryString); } catch (XPathException e) { System.err.println("XQuery did not compile correctly: " + xQueryString); } final DynamicQueryContext dynamicContext = new DynamicQueryContext(configuration); DocumentWrapper documentWrapper = new DocumentWrapper(tidyDom, wikipediaGameURL.toString(), configuration); dynamicContext.setContextItem(documentWrapper); final Properties props = new Properties(); props.setProperty(OutputKeys.METHOD, "xml"); props.setProperty(OutputKeys.INDENT, "yes"); System.out.println("executing XQuery: \n" + xQueryString); try { xQueryExpression.run(dynamicContext, new StreamResult(System.out), props); } catch (XPathException xpe) { System.err.println("error executing XQuery: " + xpe.getMessage()); } System.out.println(); } } </code> This runs two small XQuery expressions, culled down for simplicity (but wrapping the results so the simple pretty-print is less error prone). I've got a few questions, but the first is this: I'm searching for a table entity that has a "class" attribute defined. If you run the code, you should see that the first is found, the second isn't. Would it be possible for someone help me understand why? executing XQuery: <results> { for $dataTable in // where contains($dataTable/@class, "infobox vevent") return $dataTable } </results> [ results in the searched table structure] executing XQuery: <results> { for $dataTable in //table where contains($dataTable/@class, "infobox vevent") return $dataTable } </results> [ results in nothing ] The only difference to my eyes seems to be that the first case is searching for any element ("//*") for which the "where" conditions are met, and the second is searching for any table element ("//table") for the same "where" conditions. Again, really sorry to bother, but I've been searching resources for about a week, and I still fail to follow. Any help would be much appreciated, and thanks so much for your time! Jake
Replies (3)
Please register to reply
RE: Neophyte question concerning XQuery results. - Added by Anonymous over 16 years ago
Legacy ID: #5145571 Legacy Poster: David Lee (daldei)
My guess is that the HTML your looking at uses <TABLE> not <table> HTML is case in-sensitive, XML is case sensitive.
RE: Neophyte question concerning XQuery resul - Added by Anonymous over 16 years ago
Legacy ID: #5145585 Legacy Poster: Michael Kay (mhkay)
Firstly, can I suggest you use the Saxon forum only for questions that are specific to Saxon? General XQuery coding questions are better asked on the talk@x-query.com list. A quick look at your source document shows that it is XHTML in the namespace xmlns="http://www.w3.org/1999/xhtml". Therefore unqualified element names such as <table> won't match; it has to be x:table where you have declared declare namespace x="http://www.w3.org/1999/xhtml"; I suspect that in fact you could simplify the query to declare namespace x="http://www.w3.org/1999/xhtml"; <results> { //x:table[@class = "infobox vevent"] } </results> Michael Kay http://www.saxonica.com/
RE: Neophyte question concerning XQuery results. - Added by Anonymous over 16 years ago
Legacy ID: #5145647 Legacy Poster: Jake Gage (dispader)
Thank you so much, Michael! It was, in fact, completely a namespace issue. I had also suspected that it was a misuse of XQuery rather than a Saxon issue; but thank you for clearing this up, and so promptly, no less. Jake
Please register to reply