Project

Profile

Help

Bug #4839

collection(): failed to parse XML file

Added by O'Neil Delpratt 11 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Category:
.NET API
Sprint/Milestone:
-
Start date:
2020-11-25
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
10, trunk
Fix Committed on Branch:
10, trunk
Fixed in Maintenance Release:
Platforms:

Description

Reported by user on saxon mailing list:

When the collection() feature is used in the XSLT stylesheet. The following error occurs:

collection(): failed to parse XML file file:/D:/Test/sample1.xml: I/O error reported by XML parser processing file:/D:/Test/sample1.xml: Could not find file 'D:\Test\doctype.dtd'.

It appears the default CollectionFinder does not use XsltTransformer.InputXmlResolver;

The problem is the C# .NET code does not have its own CollectionFinder as in the Java product

History

#1 Updated by Emanuel Wlaschitz 11 months ago

Thanks for logging this!

As far as I can tell, the CollectionFinder does exist and we can set one using processor.SetProperty(Feature<CollectionFinder>.COLLECTION_FINDER, new CustomCollectionFinder()) (or using transformer.Implementation.setCollectionFinder(new CustomCollectionFinder()) to be more localized) and its findCollection method will be called.

I just don't see how we could affect how documents are loaded, as the documentation seems to suggest it only returns URIs and not loaded documents.

#2 Updated by Michael Kay 11 months ago

Sorry to confuse. What I meant to say was that the default in the .NET product is to use the Java CollectionFinder, which of course has no knowledge of .NET-specifics like the XmlResolver. (I'm not even sure if that statement is correct, it needs further investigation).

#3 Updated by O'Neil Delpratt 11 months ago

  • Tracker changed from Bug to Feature

#4 Updated by O'Neil Delpratt 10 months ago

  • Status changed from New to In Progress
  • Applies to branch deleted (9.9)

I have added the @CollectionFinder@ feature to .NET, which will available in the next maintenance release.

Users can now define their own @ICollectionFinder@ and set it on the Processor object to be used in XQuery, XPath XSLT APIs.

As in Java, we now have @IResourceCollection@ interface to map URI of the collection into a sequence of Resource objects. We have a number of implementations of @IResourceCollection@ available for users to use: @CataogCollection@, @JarCollection@ and @DirectoryCollection@.

NUnit tests added.

I am leaving this bug issue open until API doc is complete.

#5 Updated by O'Neil Delpratt 10 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
  • Fix Committed on Branch 10, trunk added

Bug fixed and committed on Saxon10 and trunk branches.

#6 Updated by Emanuel Wlaschitz 9 months ago

Just checking, this CollectionFinder will allow us to set a custom .NET XmlResolver to be used when loading individual entries of the collection, right?

#7 Updated by O'Neil Delpratt 9 months ago

Yes, the CollectionFinder should use your custom XsltTransformer.InputXmlResolver. If you have a sample application I will happy to test your setup with this new feature.

#8 Updated by Emanuel Wlaschitz 9 months ago

We don't have a ready-to-use sample application, but we can make one.

Since we don't have the change yet, we ran the following with transform.exe collection.xsl -it:main - but I'm confident you can turn this into a .NET testcase:

collection.xsl (which simply looks at all XML files in the same folder and prints their root element name)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">

    <xsl:template name="main">
        <xsl:apply-templates select="collection('./?select=*.xml')" mode="print-root" />
    </xsl:template>

    <xsl:template match="/" mode="print-root">
        <xsl:text>&#xA;</xsl:text>
        <xsl:value-of select="name(/*)" />
    </xsl:template>

</xsl:stylesheet>

a.xml in same folder as collection.xsl (the DTD does not really matter, but removing it allows the XSLT to run)

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test SYSTEM "does-not-exist.dtd">
<a/>

You can add more XML files if you want, both with and without DOCTYPE, where the ones with a System ID trigger the exception from this issue.

In C#, we'd use a custom XmlResolver like this:

public class DtdIgnoringResolver : XmlResolver
{
    private readonly XmlResolver _innerResolver;

    public DtdIgnoringResolver(XmlResolver innerResolver)
    {
        _innerResolver = innerResolver ?? throw new ArgumentNullException(nameof(innerResolver));
    }

    public override ICredentials Credentials
    {
        set { _innerResolver.Credentials = value; }
    }

    public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
    {
        if (!string.IsNullOrEmpty(absoluteUri?.OriginalString) && IsDtdOrSchema(absoluteUri.OriginalString))
            return Stream.Null;
        return _innerResolver.GetEntity(absoluteUri, role, ofObjectToReturn);
    }

    private static bool IsDtdOrSchema(string filePath)
    {
        return
            filePath.EndsWith(".dtd", StringComparison.InvariantCultureIgnoreCase) ||
            filePath.EndsWith(".xsd", StringComparison.InvariantCultureIgnoreCase);
    }
}

The piece of code we use is a bit more involved and usually applies the XSLT to a file (rather than running a named template), but I'm sure this suffices. If not, let me know.

Thanks again.

#9 Updated by O'Neil Delpratt 9 months ago

Hi,

Thanks for sending this example, which I used to create a nunit test case. I can confirm it now works as expected. No longer seeing the exception.

#10 Updated by O'Neil Delpratt 9 months ago

  • Status changed from Resolved to In Progress

I am reopening this bug issue because we are still seeing the exception described in the initial report.

There was a bug in the test case in comment #9. I have set the InputXmlResolver to the DtdIgnoringResolver on the Xslt30Transformer, but this is not being filtered through to the CollectionFinder.

Further investigation required.

#11 Updated by Michael Kay 7 months ago

  • Tracker changed from Feature to Bug

#12 Updated by O'Neil Delpratt 6 months ago

  • Status changed from In Progress to Resolved

Bug fixed and test of nunit tests created.

#13 Updated by O'Neil Delpratt 6 months ago

  • Status changed from Resolved to Closed
  • Fixed in Maintenance Release 10.5 added

Bug fix applied to Saxon 10.5 maintenance release.

#14 Updated by Emanuel Wlaschitz 4 months ago

Just checking, this should be available with Saxon-HE 10.5.1N, and it should just work if we set XsltTransformer.InputXmlResolver (without having to override the CollectionFinder) - right? Do note that we're not (yet) using Xslt30Transformer - mainly because we assumed they'd be equivalent anyways; but even when changing the code to use Xslt30Transformer we see the same exception as reported initially (I/O error because the DTD is not found; with our DtdIgnoringResolver not being called)

So far, we were unsuccessful in getting this to work; and the Repository only has 10.3 available so I can't look at your tests.

Is there anything we're missing?

#15 Updated by Michael Kay 4 months ago

  • Status changed from Closed to In Progress

#16 Updated by Martin Honnen 4 months ago

Emanuel Wlaschitz wrote:

So far, we were unsuccessful in getting this to work; and the Repository only has 10.3 available so I can't look at your tests.

I think the repository should now be tat e.g. https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/saxon10/entry/src/test/nunit/SaxonNUnit/SaxonNUnit/TestCollection.cs#L694 but I am not sure it is really in sync.

#17 Updated by Emanuel Wlaschitz 4 months ago

Martin Honnen wrote:

I think the repository should now be tat e.g. https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/saxon10/entry/src/test/nunit/SaxonNUnit/SaxonNUnit/TestCollection.cs#L694 [...]

Thanks, I didn't realize this was a separate repository/project/whatever.

Martin Honnen wrote:

[...] but I am not sure it is really in sync.

It does look like it is missing the testdata, /src/test/testdata has up until collection3.xsl, but I cannot find collection4.xsl (which is used by that test).

The main difference between the test and our code is the fact that the test sets XsltCompiler.XmlResolver before calling XsltCompiler.Compile() while we attempted to set XsltTransformer.InputXmlResolver after XsltExecutable.Load() (or the equivalent for Xslt30Transformer). Using XsltCompiler.XmlResolver seems to work, but we'll have to make a few adjustments to make it pretty.

Fortunately, it doens't make a difference for us (because we don't reuse the XsltCompiler; we only reuse the Processor since we assumed it was a heavy class to create), but it really needs some documentation on how this works (especially since it differs from what is seen in this ticket).

Or was it supposed to also work when setting it on the XsltTransformer.InputXmlResolver? Naively I'd consider anything loaded by the Stylesheet (using document()/doc(), through collection() etc.) an Input, assuming it would go through that resolver, so the current behavior is a bit surprising.

#18 Updated by Michael Kay 4 months ago

The standard (default) CollectionFinder simply delegates to the underlying Java code. Unless this is customised (for example with a user-defined ResourceFactory for a particular media type) it will not use any URIResolver or XmlResolver to dereference the URIs of individual resources in the collection. The Java product offers a lot more exposed capability for customisation here than the .NET product, for example by defining additional types of ResourceCollection and associated ResourceFactory's - this is all available in theory to the .NET user, but it's a lot more deeply buried.

If you set your own CollectionFinder then of course it's entirely up to you what it does. You can either return an instance of one of Saxon's IResourceCollection implementations (for example DirectoryCollection) or you can implement IResourceCollection yourself. There are two methods to implement: getResources which underpins fn:collection(), and getResourceURIs which underpins fn:uri-collection. You then have another level of flexibility in that getResources returns a set of IResource objects which you can implement any way you choose, in particular it's entirely up to you how to dereference a URI to get a resource and turn it into an XDM Item. You can of course use an XmlResolver if it suits you to do so.

If the user calls fn:uri-collection() they can then dereference the returned URIs using (say) doc() or unparsed-text() or json-doc() and in the case of doc() this will of course invoke the run-time XmlResolver.

I hope this helps: it's a complex piece of machinery.

#19 Updated by Emanuel Wlaschitz 4 months ago

Yeah, I realize this is a fairly complex thing with many gears turning. Our primary use case here is (still) loading files from a folder using collection($uri?select=$pattern), which simply fails (or rather: failed) when files in there use a DOCTYPE declaration with SYSTEM id that isn't available. And the most convinient way for us (which we also use for the direct inputs that we fully control) is to plug a .NET XmlResolver in there that simply intercepts those and says "nope" for any DTD or XSD it encounters.

So in the end, the default CollectionFinder is suitable for our needs (in part because the code is already there and we'd just be re-implementing it in C#) under the assumption that its default getResources implementation uses the XmlResolver set anywhere up the tree.

👍 We've verified this is the case for XsltCompiler.XmlResolver when we set it before compiling the stylesheet.

👎 We couldn't get it to work when we only set XsltTransformer.InputXmlResolver (or Xslt30Transformer.InputXmlResolver), which was the initial assumption after reading O'Neils Note #10.

And thats why I commented on this issue again (since we were as stuck as before, not knowing how to support our use case properly). If this is now the intended way (using XsltCompiler.XmlResolver), I'm ok with it - I just wanted confirmation before making many cascading changes to support this on our side that would end up obsolete in another update.

Please register to edit this issue

Also available in: Atom PDF