Project

Profile

Help

Adding Saxon HE to Java classpath makes XML parsing very slow

Added by Markus Karg 8 months ago

I have a weird problem and hope somebody over here has some brilliant idea: Saxon HE makes "pure" XML parsing very slow!

In one project I am using both, "pure" XML parsing, but also for just some tasks also XSLT transformation. For now, just ignore the fact that we do XSLT processing for some tasks, and let's just concentrate on the "pure" XML parsing, which is needed in much more code locations.

Once I do not have Saxon HE on the classpath, the "pure" XML parsing is pretty fast (finished in less than two seconds ). We found out that in this case XALAN Transformation Factory is getting used.

Once I do have Saxon HE to the classpath (no other code changes!), the "pure" XML parsing is very slow (finished in about one minute **** on Java 11, and in about fifteen seconds on Java 17). We found out that in this case a Saxon-Specific Transformation Factory is getting used.

We need super-fast "pure" XML processing, but we also need Saxon for modern XSLT transformation in some code locations, so we need a way to have Saxon HE on the classpath, but still get super-fast XML processing. We have no clue how to do that.

What can we do? Help please! :-)

Thanks! -Markus


Replies (17)

Please register to reply

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Christophe Marchand 8 months ago

Could you provide your "pure XML" parsing code ?

Christophe

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Michael Kay 8 months ago

I'm afraid that with performance problems the devil is always in the detail, and it's very rarely possible to work out exactly what's wrong from a such a general description.

Your post is confusing because you talk about doing "pure XML parsing" using Xalan, but Xalan is an XSLT transformation engine, not an XML parser. It's also often used for "pure serialization", that is to do an identity transformation followed by serialization.

If Saxon is on the classpath, then creating a TransformerFactory using the default factory API will give you a Saxon TransformerFactory. It's easy enough to get a Xalan TransformerFactory instead if you're able to change the Java code to ask for it; failing that it can probably achieved by setting the JAXP system properties (but I forget the detail). Setting the jaxp.debug system property to "1" can help you see what's going on.

You could tackle the problem that way, or you could try to work out what the actual problem is when Saxon is invoked. If Saxon is taking one minute to do something very simple, then there's a good chance this is caused by an HTTP request to the W3C web server to fetch some standard W3C resource such as a DTD (W3C deliberately throttles such requests). Using some HTTP network monitoring tool would confirm whether this is the problem. Saxon should resolve all the common requests using its own local copies (up to Saxon 10) or the copies held by the XmlResolver (thereafter). If that's not happening, we can investigate why - but first it needs some drilling down to see whether that's the actual issue.

I'm afraid the JAXP design, where putting something on the classpath fundamentally affects how your application behaves, is hopelessly misconceived, but there's nothing we can do about it. My advice would be to avoid using the general TransformerFactory.newInstance() method and make your application always request a specific implementation.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 8 months ago

Christophe,

thank you for picking up this issue.

I discussed with my team and they told me that the original description was slightly incorrect. What they meant with "pure XML parsing" is not what you and me actually understand by that term. In fact, they just stripped down their reproducer and now confirmed that "pure XML parsing" is fine with and without Saxon, but the delay is noticed in JAXB (hence Node-to-Object matching) use cases. So the question is: Why is JAXB fast with XALAN but slow with Saxon?

They are preparing a mini reproducer to demonstrate their actual code. I will post it once I received it.

Sorry for the confusion; my team was not communicating precise enough.

Thanks -Markus

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 8 months ago

Christophe,

so here is what my team did:

    JAXBContext jaxbContext = JAXBContext.newInstance(...long list of classes...);
    
    var unmarshaller = jaxbContext.createUnmarshaller();
    var file = Files.newInputStream(Paths.get("C:\\SomeDir\\Some.xml"));
    
    unmarshaller.unmarshal(file);

As already confirmed, the "pure XML parsing" works fine, just within a second. The problem only occurs when "unmarshalling".

-Markus

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Michael Kay 8 months ago

Well, Saxon doesn't do JAXB processing, so it's something lower-level than that. Setting jaxp.debug="1" sounds like a useful first step.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Christophe Marchand 8 months ago

@Markus, what is your version of Java ? And if > 11, which implementation of JAXB do you use ?

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Christophe Marchand 8 months ago

And Mike is probably correct on the low-level API, and requests to W3C site.

Could you provide your XML, or part of, mainly grammars declarations (DTD, schemas and so on) ?

Christophe

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 8 months ago

Sorry for the delay. I will answer all open questions as soon as my team is back from PTO.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 8 months ago

Thank you for your patience. I have talked with my team and here are the definitive answers:

  • The JAXB application runs on Java SE 11. It is fast (needs some seconds) to parse XML using JAXB into a Java Object. The used implementation of JAXB is com.sun.xml.bind:jaxb-impl:jar:2.3.3.
  • As soon as we add the Saxon HE JAR on the classpath ( without any changes in the code ), the same JAXB application needs several minutes to parse the exact same XML.
  • The parsed XML is attached.
  • -Djaxp.debug=1 produces just one difference between the "fast" case (without Saxon HE on the CP) and the "slow" case (with Saxon HE on the CP):
JAXP: find factoryId =javax.xml.parsers.DocumentBuilderFactory
JAXP: loaded from fallback value: com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
JAXP: created new instance of class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl using ClassLoader: null

Adding Saxon HE on the CP, these lines are added in the log, while they are missing without Saxon HE on the CP.

  • Wireshark proofs that no downloads are attempted at runtime.
  • Setting an explicit TransformerFactory did not improve the situation.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 8 months ago

This is the XML parsed with JAXB.

Test.xml (310 KB) Test.xml This is the XML parsed with JAXB.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Michael Kay 8 months ago

The odd thing there is that the jaxp.debug trace shows no evidence of any attempt to load net.sf.saxon.TransformerFactoryImpl, or indeed any other Saxon code. In fact, there is no attempt to load ANY TransformerFactory.

It's true that IF the Saxon IdentityTransformer gets invoked to write to a DOMResult, we will use JAXP to instantiate a DocumentBuilderFactory, and this will produce the above JAXP debug output. But that doesn't account for the inordinate length of time it is taking.

And if the JAXP debug output is different when Saxon is on the classpath, that strongly suggests Saxon is being loaded, but if there's no JAXP debug output, then it is being loaded using a mechanism other than JAXP. Perhaps a class loading trace would be informative.

Having eliminated network access, do you know if the process is CPU-bound? My next step would be to look at it with a CPU profiler (I tend to use JProfiler) to see if that sheds any light on where the time is being spent.

The XML itself is clearly innocuous; it has no DTD or external entity references or schema references, and if processed standalone using Saxon XQuery it parses in 12ms.

I also tried a unit test that uses the Saxon identity transformer to copy this XML document to a DOMResult constructed by calling DocumentBuilderFactory.newInstance(). There's no hint of performance problems; but I did get slightly different jaxb.debug output. The output was:

JAXP: find factoryId =javax.xml.parsers.DocumentBuilderFactory
JAXP: find factoryId =javax.xml.parsers.SAXParserFactory

with no messages indicating what class was actually found. My own tracing indicates that it used org.apache.xerces.jaxp.DocumentBuilderFactoryImpl which is the Apache version of Xerces rather than the JDK internal version. Getting Apache Xerces off my classpath involves major fiddling about with my build environment, and seems unlikely to be helpful.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 7 months ago

Unfortunately we cannot switch over to latest JAXB quickly, due to the Jakarta Namenspace change. So instead we've investigated further the problem in JAXB 2.x. As a first step, we excluded all of Saxon HE's dependencies to prevent it from starting. As our test code is not making active use of Saxon, this should be a no-op. But indeed, the surprising result is the proof that JAXB internally is making use of Saxon (as soon as it is found on the classpath):

Exception in thread "main" javax.xml.transform.TransformerFactoryConfigurationError: Provider for class javax.xml.transform.TransformerFactory cannot be created
        at java.xml/javax.xml.transform.FactoryFinder.findServiceProvider(FactoryFinder.java:295)
        at java.xml/javax.xml.transform.FactoryFinder.find(FactoryFinder.java:248)
        at java.xml/javax.xml.transform.TransformerFactory.newInstance(TransformerFactory.java:86)
        at com.sun.xml.bind.v2.util.XmlFactory.createTransformerFactory(XmlFactory.java:150)
        at com.sun.xml.bind.v2.runtime.JAXBContextImpl.createTransformerHandler(JAXBContextImpl.java:717)
        at com.sun.xml.bind.v2.runtime.unmarshaller.DomLoader$State.<init>(DomLoader.java:45)
        at com.sun.xml.bind.v2.runtime.unmarshaller.DomLoader.startElement(DomLoader.java:88)
        at com.sun.xml.bind.v2.runtime.unmarshaller.ProxyLoader.startElement(ProxyLoader.java:30)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:545)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:524)
        at com.sun.xml.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:137)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:518)

So at least we now know why simply putting Saxon on the Classpath has an effect at all.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 7 months ago

NB: A quick test confirmed that explicitly setting the original, internal transformer factory is an effective workaround for the encountered delay. Using this trick, the software (having Saxon on the Classpath) performs the JAXB unmarshalling as quickly as withouth Saxon on the Classpath. Nevertheless, we should further analyze why Saxon makes JAXB so slow, as this trick might break the WORA principle as soon as a JRE is published containing a different factory.

System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 6 months ago

I got some more information for you, hoping you can see something that can get improved in Saxon itself: It seems the problem is that JAXB is calling TransformerFactory.newInstance() not just once (as one would expect) but it actually calls it several times. This is not much of a problem with the JDK's bundled implementation (which only needs few ms to return a new instance), so I think this is why the JAXB team did not care so far (BTW, even the latest JAXB implementation does that IIUC its current source code on Github). Using Saxon it becomes a problem, as I have benchmarked that calling TransformerFactory.newInstance() one hundred times consumes approx. five seconds on Saxon 12.3 but only 250 ms using JDK's bundled implementation. Or in other words: Saxon's bootstrap is 20 times slower. I have no knowledge of Saxon's internals, but I do assume that 20 times is a factor that could get improved?

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Michael Kay 6 months ago

Unfortunately I don't think the performance of TransformerFactory.newInstance() is something we can do much about. It's not a Saxon method, after all, it's a JDK method. I believe the cost is largely that the method searches the classpath opening each JAR file and looking for one with a suitable manifest.

Once Saxon is found, the instantiation creates a new s9api Processor and underlying Configurtation. It's true that this may execute a fair bit of initialisation code, which is intended to be one-time-only code, so an application that does this repeatedly is certainly behaving inefficiently. We could consider deferring this cost until first use of the TransformerFactory; but I think it's unlikely that anyone creates a TransformerFactory and never uses it, so we would only be moving the cost.

Generally I think we have implemented TransformerFactory the way it is designed to be implemented, and if someone is misusing it, that's not really our problem.

We could consider, in a future release, opting out of JAXP by putting all the factory classes in a separate JAR file which doesn't need to be on the classpath unless you actually want to invoke Saxon using JAXP.

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Markus Karg 6 months ago

I really do share your vision that upfront initialization and then reusing objects is best for performance, and I really would beg you to not consider opting out of JAXP. Instead, I will try finding some time to support the JAXB-RI team; possibly they would accept a pull request from me, preventing the repeated TransformerFactory.newInstance() calls.

Nevertheless, I just wanted to share some more benchmarking with you, as there might be potential for optimization still. First of all, you are right with your assumption that the service loader needs long time to check all the JARs. In fact, it eats up two thirds of that six seconds mentioned above (this is easy to simulate, just pass the class name to newInstance() and the search time is gone). OTOH there still is one third left, which I noticed is completely bound to new Configuration(). That rather lengthy constructor actually eats up two of the six seconds, which is incredibly long. Again, I do understand that upfront initialization is best, and I do not know your code in deeper detail so far, but I would not rule out the idea to perform lazy initialization at least for those parts that are unlikely to be used and / or are causing the biggest delays (the idea is: if it is seldomly used, it makes no sense to construct it upfront). And yes, I have read the comment in the code about eaxtly that not being done deliberately.

So I would propose to look into both, JAXB-RI first (as it is definitively done rather badly), followed by new Configuration(). I assume you would be willing to review my changes in case I come up with an efficient solution?

RE: Adding Saxon HE to Java classpath makes XML parsing very slow - Added by Michael Kay 6 months ago

If you discover anything about the initialization path for a Configuration, then of course that would be interesting. It's not something we have devoted a great deal of attention to, since it's usually a one-off cost.

    (1-17/17)

    Please register to reply