Quick processing with DocumentBuilder
Added by Anonymous over 13 years ago
Legacy ID: #10122561 Legacy Poster: AlecF (yvx9959)
I'm using xpath support in Saxon 9.3HE to extract text from blog pages posted by friends and colleagues. The process of creating an XdmNode with s9api DocumentBuilder from an xhtml file always requires 100+ seconds and I am struggling to determine how to build an XdmNode from the original web pages quickly. In contrast, when I experiment with a well formed XML file the build method completes sub-second. Document validation is not required for my purpose so I explicitly disable it. Also, I have read that Saxon prefers StreamSource and SAXSource. In my experience, though, calling build with either as a Source does not reduce the time requirement. A snippet of my code follows for illustration. [code] DocumentBuilder builder = processor.newDocumentBuilder(); builder.setSchemaValidator(null); // disable validation builder.setDTDValidation(false); // build from previously fetched xhtml file (10 - 30KB) // HtmlTidy used to clean up page first and temporarily save to disk // 100+ seconds required to build an XdmNode XdmNode doc = builder.build(new File(fileName)); // xpath compile quick and successful XPathCompiler xPathCompiler = processor.newXPathCompiler(); xPathCompiler.declareNamespace("", "http://www.w3.org/1999/xhtml"); XPathExecutable xPathExecutable = xPathCompiler.compile(xpath); [/code] My question is what technique should I use to reduce the time required to build a document? Thank you for your consideration.
Replies (2)
RE: Quick processing with DocumentBuilder - Added by Anonymous over 13 years ago
Legacy ID: #10125723 Legacy Poster: Michael Kay (mhkay)
My first guess would be that the time is being spent fetching the XHTML DTD from a web server. If that's the case, the answer is to redirect the DTD references to a local copy by using a catalog resolver. Note that the DTD will be fetched whether or not you are performing validation.
RE: Quick processing with DocumentBuilder - Added by Anonymous over 13 years ago
Legacy ID: #10325953 Legacy Poster: AlecF (yvx9959)
Thank you for the diagnosis. I used org.apache.xml.resolver.tools.CatalogResolver and my initial parse time dropped to < 2 seconds. As a hint to others, in my case I created a CatalogManager.properties file, catalog file, and then added a CatalogResolver from the Xerces project in the following manner: [code] XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setEntityResolver(resolver); InputSource is = new InputSource(new FileReader(fileName)); javax.xml.transform.sax.SAXSource saxSource = new SAXSource(reader, is); DocumentBuilder builder = processor.newDocumentBuilder(); builder.setSchemaValidator(null); builder.setDTDValidation(false); XdmNode doc = builder.build(saxSource); // continue with xpath handling [/code]
Please register to reply