Image tags with base64

Added by Kean Erickson almost 4 years ago

Hello, I'm trying to work around a problem using Saxon-EE 9.7.0-2

When a page has a large image (several megabytes) using base64 encoding, running SAXParser.parse on the file contents can take quite a long time. I'm wondering if there's any way to have Saxon just ignore these tags. I've been trying to provide a DefaultHandler type class to change the startElement behavior for image tags, but it seems that, since the overridden methods do nothing by default, I can only add to the existing activity rather than prevent anything in particular from being processed. Is this the case?

I'm wondering how I can make SAXParser ignore these tags.

Thanks!

Replies (2)

RE: Image tags with base64 - Added by Michael Kay almost 4 years ago

Obviously the XML parser as a very minimum needs to scan the text looking for the end tag. Potentially it would be possible, by digging deep into the parser code, to avoid the costs of decoding the contents of the text node and building it as a string in memory, but that would require changes to the internals of the XML parser.

It might be that you would get better results with a pull parser such as Woodstox; if nothing else, this gives you the ability at the API level to skip nodes, and it's conceivable that the parser might avoid some of the decoding/string assembly costs if you never ask for the content. You could adapt the code in Saxon's PullPushTee class (which reads from a pull parser and copies events to a push pipeline) to skip selected text nodes.

Woodstox is at https://github.com/FasterXML/woodstox and I'm pleased to see that Tatu Saloranta (aka cowtowncoder) seems to be actively developing it still. It may be worth discussing your requirements with him.

RE: Image tags with base64 - Added by Kean Erickson almost 4 years ago

Thanks so much Michael, great suggestions as always.

(1-2/2)

Please register to reply

Project

Profile

Help

Saxon