Project

Profile

Help

Bug #4326

NullPointerException in XSLGeneralIncorporate

Added by Radu Coravu about 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Internals
Sprint/Milestone:
-
Start date:
2019-09-30
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
9.9, trunk
Fix Committed on Branch:
9.9, trunk
Fixed in Maintenance Release:
Platforms:

Description

Possibly related with: https://saxonica.plan.io/issues/2932 Needs some complex steps to reproduce.

  1. Using the DITA Open Toolkit 3.3.4 downloaded from here: https://www.dita-ot.org/download
  2. Create a folder somewhere on your disk named "projets en développement", unzip the DITA OT there. It's important to use that French accent character there.
  3. Open a console, change directory to the "DITA-OT\bin" folder, run:
dita -i ../docsrc/site.ditamap -f html5 -v
  1. The transformation will end with a NullPointerException, I had to patch the DITA OT code to obtain the full stack trace which I'm pasting below. The NullPointerException can also be reproduced using a hand-built DITA OT kit which has Saxon 9.9 bundled.
    [junit]      [exec]    [mapref] java.lang.NullPointerException
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.XSLGeneralIncorporate.getIncludedStylesheet(XSLGeneralIncorporate.java:127)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.StylesheetModule.spliceIncludes(StylesheetModule.java:489)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.XSLGeneralIncorporate.getIncludedStylesheet(XSLGeneralIncorporate.java:158)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.StylesheetModule.spliceIncludes(StylesheetModule.java:489)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.PrincipalStylesheetModule.spliceUsePackages(PrincipalStylesheetModule.java:505)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.PrincipalStylesheetModule.preprocess(PrincipalStylesheetModule.java:331)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.Compilation.compilePackage(Compilation.java:286)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.StylesheetModule.loadStylesheet(StylesheetModule.java:259)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.style.Compilation.compileSingletonPackage(Compilation.java:107)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.s9api.XsltCompiler.compile(XsltCompiler.java:785)
    [junit]      [exec]    [mapref] 	at net.sf.saxon.jaxp.SaxonTransformerFactory.newTemplates(SaxonTransformerFactory.java:149)
    [junit]      [exec]    [mapref] 	at org.dita.dost.module.XsltModule.execute(XsltModule.java:106)
    [junit]      [exec]    [mapref] 	at org.dita.dost.ant.ExtensibleAntInvoker.execute(ExtensibleAntInvoker.java:182)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
    [junit]      [exec]    [mapref] 	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    [junit]      [exec]    [mapref] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    [junit]      [exec]    [mapref] 	at java.lang.reflect.Method.invoke(Method.java:498)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:99)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Task.perform(Task.java:350)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Target.execute(Target.java:449)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Target.performTasks(Target.java:470)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1391)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:36)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Project.executeTargets(Project.java:1254)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:437)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.taskdefs.CallTarget.execute(CallTarget.java:106)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
    [junit]      [exec]    [mapref] 	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    [junit]      [exec]    [mapref] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    [junit]      [exec]    [mapref] 	at java.lang.reflect.Method.invoke(Method.java:498)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:99)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Task.perform(Task.java:350)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Target.execute(Target.java:449)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Target.performTasks(Target.java:470)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1391)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Project.executeTarget(Project.java:1364)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
    [junit]      [exec]    [mapref] 	at org.apache.tools.ant.Project.executeTargets(Project.java:1254)
    [junit]      [exec]    [mapref] 	at org.dita.dost.invoker.Main.runBuild(Main.java:637)
    [junit]      [exec]    [mapref] 	at org.dita.dost.invoker.Main.startAnt(Main.java:196)
    [junit]      [exec]    [mapref] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    [junit]      [exec]    [mapref] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    [junit]      [exec]    [mapref] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    [junit]      [exec]    [mapref] 	at java.lang.reflect.Method.invoke(Method.java:498)

You may want to edit the "DITA-OT\bin\dita" script and add first in the classpath your Saxon classes folder so that you can add logging details. I added some logging and patched some Saxon classes but I was not able to understand the problem. The problem has been around for a long time, again it's probably related to the fact that content is cached with one key and then retrieved with a different one.

History

#1 Updated by Michael Kay about 2 years ago

Thanks.

I wonder if the DITA code includes its own URIResolver? I've always thought there might be a problem if an application-supplied URIResolver does absolutization differently from the way we think it should be done.

We first load stylesheet modules these days from the UseWhenFilter, handling the xsl:include during the initial parse, and then we expect to retrieve the module (as a document tree) when we revisit the xsl:include instruction later, after tree construction, This relies on the base URI being computed in the same way on both occasions, and my suspicion is that this isn't happening if the path contains special characters that aren't allowed in URIs.

#2 Updated by Radu Coravu about 2 years ago

DITA OT starts this particular "mapref" XSLT stage from the Java code. The code creates a Xerces catalog manager, it looks something like:

    /**
     * Get CatalogResolver.
     * @return CatalogResolver
     */
    public static synchronized CatalogResolver getCatalogResolver() {
        if (catalogResolver == null) {
            final CatalogManager manager = new CatalogManager();
            manager.setIgnoreMissingProperties(true);
            manager.setUseStaticCatalog(false); // We'll use a private catalog.
            manager.setPreferPublic(true);
            final File catalogFilePath = new File(ditaDir, Configuration.pluginResourceDirs.get("org.dita.base") + File.separator + FILE_NAME_CATALOG);
            manager.setCatalogFiles(catalogFilePath.toURI().toASCIIString());
            //manager.setVerbosity(10);
            catalogResolver = new CatalogResolver(manager);
        }

        return catalogResolver;
    }
transformerFactory = (SAXTransformerFactory) TransformerFactory.newInstance();
        transformerFactory.setURIResolver(CatalogUtils.getCatalogResolver());

It might be interesting in the code above that the reference to the base catalog is something like this "catalogFilePath.toURI().toASCIIString()". I think that this base catalog reference will later be used to resolve relative references to all XSLT mappings defined in the XML catalog. The "toASCIIString()" corrects all non-ASCII characters to their escaped URL equivalents. If something like this would have been used "catalogFilePath.toURI().toString()" then only spaces would have been converted to URL encoded equivalents.

It's problematic to know how much an URL should be corrected. So it might be possible that at some point you need to compare in the cache an URL which has been drastically corrected (all non-ASCII chars replaced with equivalents) and an URL which has been moderately corrected (only spaces and a handful of other symbols have been replaced with equivalents)

#3 Updated by Michael Kay about 2 years ago

I think the issue is consistency: we need to generate a key that identifies the document consistently. This is difficult when part of the job of generating the key is delegated to the user-supplied URIResolver. It doesn't really matter here whether the key is a legal URI, only that it's (a) consistent and (b) highly likely to be unique. We can't really trust the systemId in the Source object returned by a user-written URIResolver to have those properties -- it could be anything. And we don't want to call it more than once, because it might do expensive fetching of resources across the network.

It's always been a weakness of the URIResolver interface that it doesn't separate relative URI resolution from URI dereferencing.

#4 Updated by Michael Kay about 2 years ago

Before actually trying to run the DITA test case, I'm looking at this by code inspection. The question is, under what circumstances could the base URI for the including module be computed differently (a) during the (SAX-like) UseWhenFilter prepass, and (b) by calling getBaseURI() on the xsl:include element in the constructed document? Since the code for the two cases is quite distinct, it's easy to imagine that there might be circumstances where the results are different: but at the moment I'm having difficulty constructing an example.

The UseWhenFilter starts with the systemId set on the Receiver, converts it to a URI using new URI(), and then maintains a Stack to hold the base URI at each level. If an xml:base attribute is seen, it is resolved against the URI supplied in the location information by the XML parser (not against the base URI of the parent), which could lead to complications when external entities or XInclude are involved. There could also potentially be issues if the systemId set in the Source object differs from the systemId reported in the location information. This could happen for example if the input is a SAXSource whose XMLReader reports a location unrelated to the systemId property of the SAXSource.

So far, however, I haven't been able to come up with a scenario that fails in practice.

#5 Updated by Radu Coravu about 2 years ago

I modified the "dita-ot-3.3.4\bin\dita.bat" and added a reference to my compiled Saxon patch classes in the classpath list:

 -cp "D:\projects\oxygen-dita-ot-3.x\target\classes;%DOST_PATCHES_CP%;%CLASSPATH%"

After this I added more logging in the "net.sf.saxon.style.XSLGeneralIncorporate.getIncludedStylesheet(StylesheetModule, int)" right before the NPE happens, I logged the href and the base URI, then the document key created for them. I also logged the "net.sf.saxon.style.Compilation.getStylesheetModules()" map.

   [mapref] HREF plugin:org.dita.base:xsl/common/dita-utilities.xsl base file:/d:/projets en développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/preprocess/maprefImpl.xsl
   [mapref] DOC KEY file:/d:/projets en développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/preprocess/maprefImpl.xsl/../plugin:org.dita.base:xsl/common/dita-utilities.xsl

One problem so far: The reference "plugin:org.dita.base:xsl/common/uri-utils.xsl" is mapped through the XML catalog support to some XSLT location on disk. So in the map of stylesheet modules the key is "plugin:org.dita.base:xsl/common/uri-utils.xsl" but the key constructed for the href and the base seems to try and make the href relative to the base although the href is a kind of URI mappable through the XML catalog.

Then I continued, there is an "if" clause in the "XSLGeneralIncorporate":

includedSheet = (XSLStylesheet)psm.getStylesheetDocument(key);
            if (includedSheet != null) {
             ......

I added some logging in the "PrincipalStylesheetModule" and its internal cache contains only:

  [mapref] IN CACHE SDM {plugin:org.dita.base:xsl/preprocess/maprefImpl.xsl=net.sf.saxon.style.XSLStylesheet@2b7774d5}

so the code continues on the "else" and does this:

Map<DocumentURI, TreeInfo> map = getCompilation().getStylesheetModules();

so I added some logging to see what's inside the map:

 [mapref] IN MAP {plugin:org.dita.base:xsl/common/uri-utils.xsl=net.sf.saxon.tree.linked.DocumentImpl@3ef2b8e5, plugin:org.dita.base:xsl/common/dita-utilities.xsl=net.sf.saxon.tree.linked.DocumentImpl@49190ed6, plugin:org.dita.base:xsl/common/functions.xsl=net.sf.saxon.tree.linked.DocumentImpl@5d717f19, plugin:org.dita.base:xsl/common/output-message.xsl=net.sf.saxon.tree.linked.DocumentImpl@18715bb, plugin:org.dita.base:xsl/preprocess/maprefImpl.xsl=net.sf.saxon.tree.linked.DocumentImpl@2a19a0fe}

So inside the map we have a mapping for the URI "plugin:org.dita.base:xsl/common/dita-utilities.xsl" to a compiled document. But the DocumentKey looks something like "file:/d:/projets en développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/preprocess/maprefImpl.xsl/../plugin:org.dita.base:xsl/common/dita-utilities.xsl" so it's not found in the map.

#6 Updated by Radu Coravu about 2 years ago

Then I added some logging in "UseWhenFilter.processIncludeImport(NodeName, Location, URI, boolean)" to see what the document key set in the stylesheet modules map looks like:

  [mapref] COMPUTE DOC KEY plugin:org.dita.base:xsl/common/uri-utils.xsl base file:/d:/projets%20en%20développement/oxygen-dita-ot-3.x/bin/file:/d:/projets%20en%20développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/common/dita-utilities.xsl resolver class    [mapref] DOC KEY plugin:org.dita.base:xsl/common/uri-utils.xsl

so somehow because the base is in this case URL escaped (but not fully, just the spaces) the "DocumentFn.computeDocumentKey(String, String, PackageData, URIResolver, boolean)" method returns a different key value.

#7 Updated by Radu Coravu about 2 years ago

So back to the "DocumentFn.computeDocumentKey" method, it does something like this:

 try {
                    URI uri = new URI(baseURI).resolve(href);
                    documentKey = uri.toString();
                } catch (URISyntaxException | IllegalArgumentException err) {
                    documentKey = baseURI + "/../" + href;
                }

Because when called from the "XSLGeneralIncorporate" the base URI contains spaces in it, creating an URI over it fails so the document key is defined in the catch statement.

So your suspicions are correct, the "net.sf.saxon.om.NodeInfo.getBaseURI()" method returns an invalid URI which does not escape white spaces.

#8 Updated by Radu Coravu about 2 years ago

The transform called from the DITA code seems to do some pretty basic stuff:

TransformerFactory tf = TransformerFactory.newInstance();
....
templates = tf.newTemplates(new StreamSource(style));

so it does not set a custom XMLReader to the initial source. Can I add some significant logging in the DITA OT Java code to help further?

#9 Updated by Michael Kay about 2 years ago

Thanks for all your detective work on this. I haven't had a chance to dive in more deeply yet.

#10 Updated by Radu Coravu about 2 years ago

With pleasure. As I was blocked because I did not understand how the "NodeInfo.getBaseURI()" is computed on the method "XSLGeneralIncorporate.getIncludedStylesheet(StylesheetModule, int)", I started looking again into the problem from the other side, from the DITA OT code.

On the DITA Open Toolkit side, the initial StreamSource with the main XSLT passed to the transformer uses the constructor "javax.xml.transform.stream.StreamSource.StreamSource(File)" which properly seems to escape both spaces and that French character.

But then I looked at the URI resolver set on the transformer factory and instead of it being the Xerces Catalog implementation it (as I originally thought) is actually Apache ANT's "org.apache.tools.ant.types.XMLCatalog" implementation which from my previous experiences I consider to be quite buggy. On the resolve(href, base) callback it returns something like:

 RESOLVE "plugin:org.dita.base:xsl/common/uri-utils.xsl" base "file:/d:/projets%20en%20développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/common/dita-utilities.xsl" to "file:/d:/projets en développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/common/uri-utils.xsl"

So the ANT XMLCatalog implementation is responsible of returning those Sources which have system IDs without whitespaces correctly escaped.

So if you would want to fix something on the Saxon side, you would need to check if the Source returned by the URIResolver used to resolve xsl:imports and includes has a system ID with illegal URI characters and then correct the system ID before using it further.

#11 Updated by Michael Kay over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Michael Kay

Radu, I'm coming back to this (trying to tidy up unfinished work), and I'm unclear of the status. Is there anything you think we need to do here?

#12 Updated by Radu Coravu over 1 year ago

Looking at my last notes on the bug, the DITA OT publishing engine uses in some places the buggy Ant catalog resolver implementation instead of the Apache resolver. I added an issue some time ago on the DITA OT issues list for this: https://github.com/dita-ot/dita-ot/issues/3376

So the question which remains is if Saxon should strive to work better with any implementation of the URI resolver interface, even if this means trying to fix on its side problems in the URI resolver implementation. In this particular case Saxon asks the resolver to resolve the URI "plugin:org.dita.base:xsl/common/uri-utils.xsl" and receives this resolved reference: "file:/d:/projets en développement/oxygen-dita-ot-3.x/plugins/org.dita.base/xsl/common/uri-utils.xsl"

Should Saxon in this case detect that the system ID returned by the URI resolver has spaces and uncorrected characters inside? And if it has should Saxon URL encode the returned system ID before using it further?

#13 Updated by Michael Kay over 1 year ago

The specifications of what exactly a URIResolver is allowed to return, and what exactly an XMLReader implementation will accept, are both pretty fuzzy round the edges. Saxon is caught in the middle here; we simply don't know whether a URI (or wannabe-URI) returned by the URIResolver is going to be acceptable to the XMLReader or not. We know that some XMLReaders are quite liberal in what they accept, for example some are known to accept a Windows filename, so it's very likely that there are URIResolvers out in the wild that take advantage of that. So I don't think Saxon can act as policeman or referee; and if we try to impose any rules, it's likely we will stop some existing applications working.

But of course we shouldn't be throwing a NullPointerException.

So this raises the question of whether we can avoid the mechanism that causes the problem, namely that when we resolve the relative URI against the base URI during XSLInclude processing on the assembled stylesheet tree, we get a different answer from the one obtained during the use-when pass.

We're certainly using the same href value in both cases, so we must be using a different value for the base URI. Looking more closely, during use-when processing, the base URI that we use is the one that we compute from our information about the including stylesheet, whereas during XSLInclude processing, the base URI is the base URI of the systemId of the included stylesheet taken from the source object returned by the URIResolver. There are probably many reasons why these can legitimately differ, for example case normalization, HTTP redirects, etc.

So what we really need to do is to find a different mechanism for locating the right stylesheet document in the cache during the second phase of processing.

Although use-when processing is done as a SAX-like filter, the next thing in the pipeline after the filter is always the stylesheet builder, so it wouldn't be a major breach of the design for the UseWhenFilter to gain direct access to the XSLInclude node being constructed, and put the a direct link to the DocumentImpl of the included stylesheet in the XSLInclude node of the including stylesheet, rather than relying on recomputing the document key. I shall experiment with that approach.

#14 Updated by Michael Kay over 1 year ago

  • Category set to Internals
  • Status changed from In Progress to Resolved
  • Priority changed from Low to Normal
  • Applies to branch 9.9, trunk added
  • Fix Committed on Branch 9.9, trunk added

As suggested, I have modified the UseWhenFilter and XSLGeneralIncorporate classes so the UseWhenFilter now directly adds the included document node as a property of the XSLGeneralIncorporate node being constructed in the stylesheet tree, where it can subsequently be referenced without going via the stylesheet module cache. The cache is still there, to avoid loading or compiling modules repeatedly when they are included/imported repeatedly.

#15 Updated by Radu Coravu over 1 year ago

Thank you. I will try to keep an eye out when the next Saxon 9.9 minor version is released, use it in a DITA OT folder located on such a file path containing French characters and see if the proper occurs again.

#16 Updated by O'Neil Delpratt over 1 year ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.9.1.7 added

Patch applied in the 9.9.1.7 maintenance release.

#17 Updated by Radu Coravu over 1 year ago

Thanks Michael. I confirm the NullPointerException is fixed.

Please register to edit this issue

Also available in: Atom PDF