Project

Profile

Help

Possible to enable Xerces grammar pool using just configuration options from command line?

Added by Eliot Kimber over 1 year ago

I'm running Saxon from the command line to do ad-hoc processing of DITA files, where a single transform instance might process several thousand topics in one go.

The process is time consuming (can take 10 minutes to parse 8000 topics), which is almost certainly because Saxon does not use the Xerces grammar cache by default (once parsed, the XSLT processing is as fast as you would expect).

I know how to use Java to set use of the Xerces grammar pool by constructing a reader that configures it yada yada but this application doesn't really justify that level of effort--it's a one time migration tool that, once used, will not be used again, at least not at the same scale.

I'm using Saxon-HE 10.8J at the moment but could upgrade to latest.

My question: is it possible to configure use of the Xerces grammar cache using only command-line configuration?

Thanks,

Eliot


Replies (5)

Please register to reply

RE: Possible to enable Xerces grammar pool using just configuration options from command line? - Added by Michael Kay over 1 year ago

As a matter of interest, is this DTD or XSD validation?

I can't advise you on what options in Xerces you need to set - looking at https://xerces.apache.org/xerces2-j/faq-grammars.html#caching-w-standards it looks quite complicated.

You can set XML parser features and properties as if they were Saxon configuration properties using the property names "http://saxon.sf.net/feature/parserFeature?uri=..." and "http://saxon.sf.net/feature/parserProperty?uri=". From the command line that would be --parserFeature?uri=....:xxx and --parserProperty?uri=....xxx but I guess colons in the URIs will probably mess things up. The next thing to try would be to put this in a Saxon configuration file.

But to be honest, I think it will be less hassle to knock up a 10-line Java application that initializes a Xerces parser directly and then executes the transformation using s9api interfaces, e.g. from a SAXSource whose XMLReader you have supplied.

RE: Possible to enable Xerces grammar pool using just configuration options from command line? - Added by Eliot Kimber over 1 year ago

Yes, these are DTD-based documents using the out-of-the-box DITA DTDs, which are quite large.

I was pretty sure this would be your answer--I may still pursue it, or just eat the time cost as this is a one-off migration process.

For context, it takes just over two hours on my newish macBook to process the 35K topics that make up the ServiceNow Platform documentation. This is consistent with how long it takes BaseX to load the same content when using DTD-aware parsing, compared without about 2 minutes with DTD parsing turned off.

If it was just me I would have done the Java work already, but as part of a team, if I write the Java code, then I have to document it, maintain it over time, etc., so the cost is not just the 10-line coding effort.

I was hoping somebody had already done this but I couldn't find anything in the various archives I searched, probably because as you say writing the Java code is not that hard..

Thanks,

Eliot

RE: Possible to enable Xerces grammar pool using just configuration options from command line? - Added by Eliot Kimber over 1 year ago

Open Toolkit turned out to be a dead end because of the way it configures its use of parsers and grammar pools--would require too much refactoring to make it work with standalone transforms.

But I have succeeded (I think) in setting up a custom SaxParserFactory class that I can then specify with the -x flag. The only unexpected wrinkle was that use of -x doesn't allow use -catalog (which makes sense, I guess), so I had to add CatalogResolver configuration as well.

Using this parser factory I can now parser 8000 topics in just a couple of minutes.

This approach does seem to require more memory--the transform before handled the largest content with the default JVM memory on my 16GB mac but with this approach it requires at least 6GB.

I assume that's a function in the change in parser, but with the memory provided, a processing run that took almost 2 hours now takes 2 minutes.

RE: Possible to enable Xerces grammar pool using just configuration options from command line? - Added by Michael Kay over 1 year ago

Good you've made progress. It's often surprising what a big difference a small tweak can make, once you've identified where the problem is.

RE: Possible to enable Xerces grammar pool using just configuration options from command line? - Added by Eliot Kimber over 1 year ago

I've created the project grammar-pool-parser factory in the DITA Community organization on GitHub: https://github.com/dita-community/grammarpool-parser-factory

It's a simple command-line action to use, requiring that you specify the -x flag and replace -catalog with -Dxmlcatalogs.

    (1-5/5)

    Please register to reply