working SAX parser to sub for Java 6 default?
Added by Anonymous almost 17 years ago
Legacy ID: #4714865 Legacy Poster: Chapman Flack (jcflack)
Hmm, I've discovered using Saxon 9 on Java 6 that I can't parse valid HTML 4.01 files. This doesn't seem to be a problem in Saxon but rather in com.sun.org.apache.xerces.internal.parsers.XMLParser. Namely, if an HTML 4.01 file has a DOCTYPE declaration with no systemId (which validator.w3.org allows), the parser complains that "White spaces are required between publicId and systemId." This I suppose I could address with a catalog implementation, but I'll come back to that point, because there's more: If I DO supply the systemId http://www.w3.org/TR/html4/loose.dtd, then the parse fails with: The declaration for the entity "HTML.Version" must end with '>' That seems to be because the official DTD has a comment inside that entity declaration, and ...xerces.internal.impl.XMLDTDScannerImpl apparently can't handle that. (What, they never thought to test on the HTML 4 DTD??) So it seems I'm not asking about a Saxon problem, but I'm looking for a good workaround for Saxon purposes. Is there a good working parser I can drop in, in place of the Java 6 default, that will not fail because a DTD contains comments? Or, are there some options I can easily turn off to suppress processing of the DTD entirely? Or, coming back to the catalog idea, I suppose worst case I could set up a catalog referring to a local copy of the DTD from which the comments had been stripped. I found an online howto for supporting catalogs in Saxon: http://www.cafeconleche.org/books/effectivexml/chapters/47.html but I got the impression it was describing an older Saxon version. Could someone steer me to a current description of how that ought to be done? Thanks, Chapman Flack
Replies (4)
Please register to reply
RE: working SAX parser to sub for Java 6 defa - Added by Anonymous almost 17 years ago
Legacy ID: #4716206 Legacy Poster: Chapman Flack (jcflack)
Ok, I was getting roughly what I deserved for trying to slurp in HTML 4 with an XML parser; sorry for the bandwidth. If I had found that the document seemed ill-formed (because of unclosed HTML 4 tags and such) it wouldn't have taken me so long to spot the obvious, but what threw me was the error parsing the DTD. Turns out that served me right as well: comments in an entity declaration get ruled out in XML's subset of SGML DTD syntax. You all knew that and now I do too. -Chap
RE: working SAX parser to sub for Java 6 defa - Added by Anonymous almost 17 years ago
Legacy ID: #4716231 Legacy Poster: David Lee (daldei)
I recommend an HTML to XML converter as a pre-processor ... I've used this technique before to convert "bad" HTML to an XML DOM tree which then can be passed to Saxon or other XML libraries. The one I've used is : NekoHTML http://sourceforge.net/projects/nekohtml -David Lee
RE: working SAX parser to sub for Java 6 defa - Added by Anonymous almost 17 years ago
Legacy ID: #4716662 Legacy Poster: Michael Kay (mhkay)
As you discovered, an XML parser will not parse HTML because HTML is not XML. John Cowan's TagSoup parser implements the SAX interface and does a good job of converting HTML to XML on the fly.
RE: working SAX parser to sub for Java 6 defa - Added by Anonymous almost 17 years ago
Legacy ID: #4720367 Legacy Poster: Chapman Flack (jcflack)
TagSoup and NekoHTML both sound like things worth looking into in case I do want to parse HTML at some point. (Actually what happened last week was just that I was trying to collect demonstration ideas for cool things to do with Saxon, and I had used it successfully earlier to screen-scrape a website that was done in xhtml - and at that time I'd been perfectly aware that was the reason it worked. But last week without thinking I tried a different site, and the surprise that the parsing failed in DTD processing was enough to distract me from the obvious fact that the site was (!x)html. Fortunate that I didn't do any of this DURING a demonstration! :) Thanks for the TagSoup and Neko tips. -Chap
Please register to reply