Project

Profile

Help

Can I use a Sax ContentHandler as the input to XSLT with s9api or JAXP?

Added by Martin Honnen almost 2 years ago

Recently I used the Apache Tika software to parse RTF into XHTML to process it with XSLT 3.0.

So far in my code based on https://tika.apache.org/2.7.0/examples.html#Parsing_to_XHTML, I have used the ContentHandler created with e.g. ContentHandler handler = new ToXMLContentHandler(); and called its toString() method to feed the XHTML string representation of the RTF to Saxon's s9api DocumentBuilder to have an XdmNode input for XSLT.

With ContentHandler being some kind of SAX interface I was wondering whether I need the toString() call doing some serialization to a string and then the string parsing by Saxon's DocumentBuilder or whether perhaps s9api or JAXP would allow Saxon to use that ContentHandler as an input. I have looked at the s9api samples and the JAXP samples but somehow it seems the initial parsing is done by an XmlReader before a ContentHandler might be usable so I haven't found anything obvious; on the other hand I never feel like I have understood all features of the SAX API.

Any thoughts whether Saxon might be able to process a ContentHandler as the input to XSLT?


Replies (3)

Please register to reply

Can I use a Sax ContentHandler as the input to XSLT with s9api or JAXP? - Added by Norm Tovey-Walsh almost 2 years ago

With ContentHandler being some kind of SAX interface I was wondering whether I need the
toString() call doing some serialization to a string and then the string parsing by
Saxon's DocumentBuilder or whether perhaps s9api or JAXP would allow Saxon to use that
ContentHandler as an input.

I think I do something like this in my Invisible XML implementation by
getting a BuildingContentHandler from
DocumentBuilder.newBuildingContentHandler(), but it might not be exactly
the same as what you’re trying to do.

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

RE: Can I use a Sax ContentHandler as the input to XSLT with s9api or JAXP? - Added by Michael Kay almost 2 years ago

With JAXP, you should be able to supply a TransformerHandler as the ContentHandler that the tika parser writes to, where the TransformerHandler is built from the Templates object representing the compiled stylesheet.

I don't think there's direct equivalent in s9api.

As Norm says, you can send the tika output to a ContentHandler obtained using DocumentBuilder.newBuildingContentHandler, which will build an XdmNode which you can then supply as input to the transformation. Fine unless you want to use streaming, or things like xsl:strip-space.

I can also think of a slightly convoluted approach: the XsltTransformer is a Destination, you can get that Destination as a Receiver by calling its getReceiver() method, then set this as the Receiver in a ReceivingContentHandler, and have tika write to the ReceivingContentHandler. But this seems to involve fiddling about with PipelineConfiguration's, so it's not very clean. It would be easy to clean it up with an XsltTransformer.asContentHandler() method.

RE: Can I use a Sax ContentHandler as the input to XSLT with s9api or JAXP? - Added by Martin Honnen almost 2 years ago

Thanks a lot, ditching Tika's ToXMLContentHandler and using Saxon's BuildingContentHandler directly as the handler to the Tika parser works fine (and even avoids some hickup/bug that ToXMLContentHandler seems to have with RTF and hyperlinks).

    (1-3/3)

    Please register to reply