Project

Profile

Help

Bug #5582

open

External resolver library returns SAXSource which StandardUnparsedTextResolver rejects

Added by Norm Tovey-Walsh about 2 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
JAXP Java API
Sprint/Milestone:
-
Start date:
2022-06-27
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
11, trunk
Fix Committed on Branch:
11, trunk
Fixed in Maintenance Release:
Platforms:
Java

Description

From Stack Overflow,

I'm using Saxon 11 from ANT. XSLT transformation works, except that the unparsed-text() function triggers a fatal error. Error message:

Resolver for unparsed-text() returned non-StreamSource;

The problem here is that StandardUnparsedTextResolver expects the resolved resource to be a StreamSource. This is likely an unreasonable expectation for third party resolvers. Ant, in particular, returns a SAXSource which seems like a reasonable lowest-common-denominator API.

Should we support a SAXSource response?


Related issues

Has duplicate Saxon - Bug #5595: Saxon 11 - error when using wrapper over ResourceResolverWrappingURIResolverDuplicate2022-07-08

Actions
Actions #2

Updated by Michael Kay about 2 months ago

Some reservations because SAXSource is obviously designed to handle XML input, but if it's needed to make Ant work then I suppose we should do it.

Actions #3

Updated by Michael Kay about 2 months ago

It does seem very odd to me to use a SAXSource (rather than a StreamSource) for a non-XML input, but if there are important legacy applications that do it, then it's certainly possible to support it. Of course the InputSource of the SAXSource must be non-null, and we need to allow for the InputSource to contain any combination of binary input stream, character stream, or system ID (URI).

Actions #4

Updated by Michael Kay about 1 month ago

I think the place to do this is probably in the class ResourceResolverWrappingURIResolver.

The contract for ResourceResolver says that if a non-XML resource is requested, the Source returned should be a StreamSource. I think that ResourceResolverWrappingURIResolver should attempt to honour this contract by converting any SAXSource returned by the URIResolver into a StreamSource.

So on both branches where we call uriResolver.resolve(), if the returned Source is a SAXSource and the request.nature is TEXT_NATURE or XQUERY_NATURE, we should attempt to return an equivalent StreamSource, using logic such as:

            if (resolved instanceof SAXSource && ((SAXSource)resolved).getInputSource() != null) {
                InputSource is = ((SAXSource) resolved).getInputSource();
                StreamSource ss = new StreamSource();
                ss.setInputStream(is.getByteStream());
                ss.setReader(is.getCharacterStream());
                ss.setSystemId(is.getSystemId());
                ss.setPublicId(is.getPublicId());
                return ss;
            }

An InputSource, unlike a StreamSource, can include an encoding. If the InputSource contains a byteStream and an encoding then we should set the Reader of the returned StreamSource to an InputStreamReader that encapsulates the byteStream and the encoding.

Actions #5

Updated by Michael Kay about 1 month ago

The class ResourceResolverWrappingURIResolver is present in SaxonCS but I think it is entirely dormant. Either we try to eliminate all the code that refers to URIResolver from SaxonCS, or we live with it, in which case the proposed code has to be marked as excluded because the SAXSource class does not exist in SaxonCS.

Perhaps we should adopt a minimum change approach on the 11.x branch and be a bit more surgical on 12.x.

Actions #6

Updated by Michael Kay about 1 month ago

  • Has duplicate Bug #5595: Saxon 11 - error when using wrapper over ResourceResolverWrappingURIResolver added
Actions #7

Updated by Michael Kay about 1 month ago

  • Category set to JAXP Java API
  • Status changed from New to Resolved
  • Assignee set to Michael Kay
  • Priority changed from Low to Normal
  • Applies to branch 11, trunk added
  • Fix Committed on Branch 11, trunk added
  • Platforms Java added

Fixed, and added unit tests (though not specifically with Ant).

I took the opportunity to drop code from the SaxonCS build that is not needed on the .NET platform, in particular, the URIResolver class no longer exists and is no longer referenced.

Actions #8

Updated by Radu Coravu about 1 month ago

tthe problem is in a way more complex, adding this as a comment here just in case you want to do something about this in Saxon (I can probably workaround this on our side).

  • I set an XMLReaderMaker net.sf.saxon.lib.ParseOptions.setXMLReaderMaker(Maker) on which I create my own XMLReader instance with its own custom entity resolver.
  • The method net.sf.saxon.Configuration.getSourceParser() sets on my XMLReader implementation a ChainedEntityResolver between your resource resolver and mine. In my opinion here my EntityResolver should be stronger, it should be the first in the ChainedEntityResolver and not the last. Because if it's the last, it never gets called, even if the resolver set up in Saxon has no idea how to resolve my entity system ID.
  • The method "org.xmlresolver.CatalogResolver.resolveEntity(String, String, String, String)" always tries to return "something" instead of returning null when it cannot find anything. For example if it gets called with the URL "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd", instead of returning null as it has no mapping for it and give other resolvers a chance, it actually tries to open a connection to the URL (CatalogResolver.uncachedResource(URI, URI)) which I sure don't want an XML catalog resolver to be doing. I prefer the XML catalog resolver to be simple instead of having some kind of caching mechanism inside for resolved resources. Also I cannot disable its caching from the outside, once the resolver has been created.
  • Back to the current fix in the "net.sf.saxon.lib.ResourceResolverWrappingURIResolver.resolve(ResourceRequest)" class where a SaxSource is converted to a StreamSource, sometimes the SaxSource returned by the wrapped URIresolver has the same system ID as the original system ID (meaning that nothing was resolved), in which case a null StreamSource should be returned, something like I did on my side:
    /**
     * Oxygen PATCH for... Convert not allowed sax sources to stream sources.
     * 
     * @param resolved The initial Sax Source.
     * @param request 
     * @return A converted stream source or null.
     */
    private static Source saxSourceToStreamSource(Source resolved, ResourceRequest request) {
      SAXSource ss = (SAXSource) resolved;
      if(request.uri != null && request.uri.equals(resolved.getSystemId())) {
        //Actually nothing was resolved
        return null;
      }
      StreamSource str = new StreamSource();
      InputSource inputSource = ss.getInputSource();
      if(inputSource != null) {
        str.setInputStream(inputSource.getByteStream());
        str.setPublicId(inputSource.getPublicId());
        str.setSystemId(inputSource.getSystemId());
        str.setReader(inputSource.getCharacterStream());
        resolved = str;
      }
      return resolved;
    }
Actions #9

Updated by Michael Kay about 1 month ago

  • Status changed from Resolved to In Progress

Reopening so the new information doesn't get lost.

Actions #10

Updated by Radu Coravu about 1 month ago

Also here "net.sf.saxon.lib.ChainedEntityResolver.resolveEntity(String, String, String, String)" if the second instance is non an EntityResolver2, maybe you can have a fallback like:

if (is == null) {
          if(second instanceof EntityResolver2) {
            is = ((EntityResolver2) second).resolveEntity(name, publicId, baseURI, systemId);
          } else {
            is = second.resolveEntity(publicId, systemId);
          }
        }
Actions #11

Updated by Norm Tovey-Walsh about 1 month ago

The fact that you can't disable the cache after the resolver is created is definitely a bug.

Actions #12

Updated by Norm Tovey-Walsh about 1 month ago

It's worth noting that, with respect to http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd, the resolution comes from the catalog in the XmlResovler Data jar, not from accessing the web.

Actions #13

Updated by Norm Tovey-Walsh about 1 month ago

Hi Radu,

I've published XML Resolver 4.4.1 that fixes the bug where disabling the cache has no effect on the already existing resolver. You should now be able to use

resolver.getConfiguration().setFeature(ResolverFeature.CACHE_ENABLED, false);

to disable the cache on the catalog resolver. That'll make it return null for resources it doesn't find in a catalog.

Please do let me know if you have a chance to try it out and if it improves things for you. Still considering the other issues...

Actions #14

Updated by Norm Tovey-Walsh about 1 month ago

FWIW, the C# resolver doesn't exhibit this bug (and I've added a test to demonstrate that). The C# resolver has a slightly different design and implements a slightly different API. It's never allowed to return null anyway.

Actions #15

Updated by Norm Tovey-Walsh about 1 month ago

  • Status changed from In Progress to Resolved

I applied the fix to return null if the resolver did nothing, updated the dependency to XML Resolver 4.4.2 (I accidentally uploaded 4.4.1 compiled with Java 11), and updated the mirror repository with the latest Saxon 11 code.

I think that fixes everything! :-)

Actions #16

Updated by Norm Tovey-Walsh about 1 month ago

In another channel, Mike points out that returning the same system identifier doesn't, in the general case, mean that the resolver didn't do anything. It might be returning an InputStream or a DOMSource or something as well. So that probably doesn't belong in the Saxon object.

Actions #17

Updated by Norm Tovey-Walsh about 1 month ago

  • Status changed from Resolved to In Progress
Actions #18

Updated by Radu Coravu about 1 month ago

Norm, I just saw your comments now, I had not registered to watch this issue and I did not receive emails for your comments. I added an extra issue here for two of my initial suggestions: https://saxonica.plan.io/issues/5605 but you can decide if they are useful or not.

Actions #19

Updated by Radu Coravu about 1 month ago

I updated the xml catalog library on the Oxygen side to version 4.4.2 but we are setting our own resolver "Configuration.setResourceResolver(ResourceResolver)" and in our resolver I took a few days ago the decision not to delegate to the default Saxon resolver at all, even as a fallback. We are in a way continuing to use the regular Apache XML resolver for resolving entities and uri's because it's still used for parsing the XML code using the Xerces parser so it's more consistent to use it also for Saxon's processing. What would be the benefits to try and migrate our catalog resolution entirely to your xmlresolver implementation? Maybe this would be also good as a frequently asked question here: https://xmlresolver.org/

Actions #20

Updated by Michael Kay about 1 month ago

The main thing to be aware of is that the standard W3C entities (such as XHTML DTDs, etc) are no longer included directly in Saxon, they are now included in the XmlResolverData package.

Actions #21

Updated by Radu Coravu about 1 month ago

I did not know Saxon had some default ways to resolve certain resource references but this should not influence things much, Oxygen comes with XML catalogs for the XHTML DTDs. The only situation in which our custom resources resolver we set on the configuration now still calls the Saxon default resolver is when the base URI starts with something like "classpath:", I think this is because sometimes Saxon references (or used to refer) certain XML Schemas bundled inside it with this protocol.

Please register to edit this issue

Also available in: Atom PDF