Project

Profile

Help

Bug #5336

closed

How to load/compile XSD 1.1 schema for XSLT 3 with SaxonCS 11.1?

Added by Martin Honnen almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
s9api API
Sprint/Milestone:
Start date:
2022-02-17
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
11, trunk
Fix Committed on Branch:
11, trunk
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

With SaxonJ EE 11.1 I can run

    Processor processor = new Processor(true);

    processor.getSchemaManager().load(new StreamSource("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));

without problems and it takes a such a short amount of time that I am sure the schema is loaded with the XmlResolver and from its XmlResolverData.

With SaxonCS 11.1, however, the similar code

            Processor processor = new Processor(true);

            processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));

gives

Saxon.Api.SaxonApiException
  HResult=0x80131500
  Message=: Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
  Source=SaxonCS
  StackTrace:
   at Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
   at Saxon.Api.SchemaManager.Compile(Uri uri)
   at SaxonCSCompileXSLT3SchemaTest.Program.Main(String[] args) in C:\SomeDir\SaxonCSCompileXSLT3SchemaTest\Program.cs:line 14

and takes a lot of time to give that error so I assume it might try to download the file from the W3C server.

Even when trying to set

            Processor processor = new Processor(true);

            (processor.Implementation.getConfigurationProperty(FeatureKeys.RESOURCE_RESOLVER) as CatalogResourceResolver).setFeature(ResolverFeature.URI_FOR_SYSTEM, true);

            processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));

it takes a lot of time and gives the same error:

Saxon.Api.SaxonApiException
  HResult=0x80131500
  Message=: Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
  Source=SaxonCS
  StackTrace:
   at Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
   at Saxon.Api.SchemaManager.Compile(Uri uri)
   at SaxonCSCompileXSLT3SchemaTest.Program.Main(String[] args) in C:\SomeDir\SaxonCSCompileXSLT3SchemaTest\Program.cs:line 16

So how do I get SaxonCS 11.1 to load/compile the schema for XSLT 3.0 and hopefully have it load the XSD using the catalog and its data dll?

Actions #1

Updated by Michael Kay almost 3 years ago

I've added this as a unit test.

The method SchemaManager.Compile(Uri) hasn't been updated to use the new Resolver infrastructure.

Also it would be more consistent for SchemaManager to have a ResourceResolver property that can be set, rather than relying on the legacy SchemaResolver.

Actions #2

Updated by Michael Kay almost 3 years ago

I tried a workaround using DocumentBuilder.Build(uri) followed by SchemaManager.Compile(XdmNode), but DocumentBuilder.Build(uri) suffers the same problem.

I can retrieve the document from the local cache using

                XQueryCompiler compiler = proc.NewXQueryCompiler();
                XQueryEvaluator eval = compiler.Compile("doc('https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd')").Load();
                XdmNode schema = (XdmNode)eval.Evaluate();

But I still get some kind of failure when doing Compile(XdmNode) on the result.

Actions #3

Updated by Michael Kay almost 3 years ago

This latest problem is the XmlResolverWrappingResourceResolver class throwing "invalid URI" when the XmlReader throws it the public ID -//W3C//DTD XSD 1.1//EN. This path doesn't seem to include the usual hack of recognising this as a Public ID rather than a URI.

Actions #4

Updated by Michael Kay almost 3 years ago

I've added unit tests on the Java side that use all three mechanisms (directly supplying a URI in a StreamSource, or getting an XdmNode via a DocumentBuilder or via an XQuery), and all three work correctly.

So how does the Java code differ from the C# code? They diverge at SchemaReader.sendSchemaSource(), which in SaxonCS is converted to a call on Sender.send(). But the divergent Java code is only setting up an XMLReader.

The real difference is that the StreamSource (containing only a URI) is passed to Platform.resolveSource() which has different implementations for SaxonJ and SaxonCS. On the SaxonCS side, it directly resolves the URI using WebClient.OpenRead(). On Java, it constructs an ActiveStreamSource that wraps the StreamSource, and passes this to the XMLReader. I'm finding it difficult to work out how the XMLReader handles it; I don't see any callback to the CatalogResolver, and yet it's coming back to quickly to have been out to the web. HTTP monitoring using Charles suggests that there is a call out to www.w3.org, but I can't see that it's requesting this URL.

Back over on the .NET side, it's very clear in Charles that we're making a real HTTP request to retrieve http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd - not the original schema document, but the DTD that it references.

Actions #5

Updated by Martin Honnen over 2 years ago

So what is the final resolution here for the .NET/C# side? Shouldn't Saxon get that http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd from its XmlResoverData.dll cache instead of trying to pull from W3C?

Actions #6

Updated by Martin Honnen over 2 years ago

Probably related, using SaxonCS from the command line to try to validate the XSD 1.1 schema against the XSD 1.1 schema fails:

 & 'C:\Program Files\Saxonica\SaxonCS-11.3\SaxonCS.exe' validate -t -s:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -xsd:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd
SaxonCS-EE 11.3 from Saxonica
.NET 5.0.9 on Windows 10.0.22000.0
Using license serial number ...
URIResolver for schema file must return a Source
Exiting with code 2

When I use a .NET 6 wrapper command tool that does nothing more than calling Saxon.Cmd.Command.Main(args.Prepend("validate").ToArray()); I interestingly enough get a different error:

saxonvalidate -s:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -xsd:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -t
SaxonCS-EE 11.3 from Saxonica
.NET 6.0.6 on Windows 10.0.22000.0
Using license serial number ...
Loading schema document https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd
: Cannot resolve external DTD subset - public ID = '-//W3C//DTD XSD 1.1//EN', system ID = 'XMLSchema.dtd'.
Fatal error during validation: : Cannot resolve external DTD subset - public ID = '-//W3C//DTD XSD 1.1//EN', system ID = 'XMLSchema.dtd'.
Exiting with code 2

So its seems the Xml resolver / resolver cache fails. Norm Tovey-Walsh, any idea on that.

Actions #7

Updated by Michael Kay over 2 years ago

  • Tracker changed from Support to Bug

I'm elevating this to a Bug. We have a SaxonCS unit test TestSchemaValidator.testSchemaForXslt30 which is failing (or running extremely slowly) as a result of this problem.

Actions #8

Updated by Michael Kay over 2 years ago

I'm investigating the unit test TestSchemaValidator.testSchemaForXslt30, which attempts to load the XSLT30 schema using a doc() call issued from an XQuery (and fails).

The XSLT30 schema URI is being successfully resolved by the catalog resolver, and returns with a ManifestResourceStream delivering the content from the XmlResolverData. We then attempt to parse this stream, using an XmlTextReader in which the XmlResolver is set to an instance of XmlResolverWrappingResourceResolver.

We see a callback from the XmlTextReader to this XmlResolver - it's calling ResolveUri() with a baseUri of null and a relativeUri() of .../XMLSchema.xsd originating internally from DtdParserProxy.get_DtdParserProxy_BaseUri(). This successfully returns the URI unchanged.

Next we see a call on ResolveUri with baseUri equal to the .../XMLSchema.xsd URI and relativeURI being the public ID -//W3C//DTD XSD 1.1//EN. We detect this as a Public ID and pass it back as unchanged as a Uri object.

Next we see a call on GetEntity with absoluteUri being the public ID -//W3C//DTD XSD 1.1//EN. We call the catalog resolver which looks this up and sets uri2=pack://application:,,,XmlResolverData;0.2.0.0;component/www_w3_org.2009.XMLSchema.XMLSchema.dtd. The resolvedResourceImpl is null, so we return null from the GetEntity() call.

Next we see a call on ResolveUri with baseUri being http://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd and relativeUri being XMLSchema.dtd. This correctly returns http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd. It appears the XmlTextReader attempts to fetch this URI itself without a further call to GetEntity().

Actions #9

Updated by Norm Tovey-Walsh over 2 years ago

If I'm understanding correctly,

Next we see a call on GetEntity with absoluteUri being the public ID -//W3C//DTD XSD 1.1//EN. We call the catalog resolver which looks this up and sets uri2=pack://application:,,,XmlResolverData;0.2.0.0;component/www_w3_org.2009.XMLSchema.XMLSchema.dtd. The resolvedResourceImpl is null, so we return null from the GetEntity() call.

This is the part that sounds like a bug. If we've worked out that the dtd is in a pack:// URI, then why is resolvedResourceImpl returned null, I wonder? I'll have to build CS and see if I can get it running under the debugger...

Actions #10

Updated by Norm Tovey-Walsh over 2 years ago

Yes, it appears that http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd isn't in the data assembly.

Actions #11

Updated by Norm Tovey-Walsh over 2 years ago

Wow, the W3C site seems to be very confused about what is available and where wrt schema validation.

https://www.w3.org/TR/xmlschema-2/datatypes.xsd

Isn't an XSD file, it's something wrapped in a pre. Even if you unwrapped the pre it would be wrong because the XML declaration is after the DTD fragment.

Actions #12

Updated by Michael Kay over 2 years ago

That's how it appears in Safari. But if you look at it using curl, it starts

<?xml version='1.0'?>

<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XSD 1.1//EN" "XMLSchema.dtd" [

<!-- provide ID type information even for parsers which only read the
     internal subset -->
<!ATTLIST xs:schema          id  ID  #IMPLIED>
<!ATTLIST xs:complexType     id  ID  #IMPLIED>
Actions #13

Updated by Michael Kay over 2 years ago

Oh, sorry, that was

https://www.w3.org/TR/xmlschema11-2/XMLSchema.xsd

But the 1.0 version is very similar in curl:

<?xml version='1.0' encoding='UTF-8'?>
<!-- XML Schema schema for XML Schemas: Part 1: Structures -->
<!-- Note this schema is NOT the normative structures schema. -->
<!-- The prose copy in the structures REC is the normative -->
<!-- version (which shouldn't differ from this one except for -->
<!-- this comment and entity expansions, but just in case -->
<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN" "XMLSchema.dtd" [

<!-- provide ID type information even for parsers which only read the
     internal subset -->
<!ATTLIST xs:schema          id  ID  #IMPLIED>
<!ATTLIST xs:complexType     id  ID  #IMPLIED>
<!ATTLIST xs:complexContent  id  ID  #IMPLIED>
Actions #14

Updated by Norm Tovey-Walsh over 2 years ago

XMLSchema.xsd is fine, but datatypes.xsd, that's another story:

$ curl -s https://www.w3.org/TR/xmlschema-2/datatypes.xsd | head
<pre><![CDATA[<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN" "XMLSchema.dtd" [

<!--
     keep this schema XML1.0 DTD valid
  -->
        <!ENTITY % schemaAttrs 'xmlns:hfp CDATA #IMPLIED'>

        <!ELEMENT hfp:hasFacet EMPTY>
        <!ATTLIST hfp:hasFacet
                name NMTOKEN #REQUIRED>
...

Note the initial <pre>.

FYI: I have reported this to webreq@w3.org.

Actions #15

Updated by Norm Tovey-Walsh over 2 years ago

The root cause of the slowness was the fact that the W3C XML Schema DTDs are being retrieved from an unofficial location that I didn't know existed. I've updated the XML Resolver data jar and assembly to correct the oversight. If you update the NuGet dependency for XmlResolverData to 1.2.0, I believe the DTD will be accessed without dereferencing from the W3C site.

Actions #16

Updated by Martin Honnen over 2 years ago

Hi Norm,

for me, with SaxonCS 11.3, .NET 6 and XmlResolverData 1.2 the code

using Saxon.Api;

Processor processor = new Processor(true);

processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));

still hangs for quite some time to then gives an exception

Saxon.Api.SaxonApiException
  HResult=0x80131500
  Nachricht = : Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
  Quelle = SaxonCS
  Stapelüberwachung:
   bei Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
   bei Program.<Main>$(String[] args) in C:\SomePath\SaxonCSCompileXSLT30Schema\SaxonCSCompileXSLT30Schema\Program.cs: Zeile5
Actions #17

Updated by Norm Tovey-Walsh over 2 years ago

Okay. I'll investigate. I was following Mike's lead on an existing test case which definitely did get faster.

Actions #18

Updated by Norm Tovey-Walsh over 2 years ago

AFAICT (and that might not be very far),

processor.SchemaManager.Compile(someURI)

makes no effort to resolve the resource through the resolver.

  1. Compile() creates a StreamSource directly from the URI
  2. We go on a long journey: SchemaManagerImpl.load(), EnterpriseConfiguration.addSchemaSource(), SchemaReader.read(), SchemaReader.buildSchemaDocument(), schemaReader.sendSchemaSource(), Sender.send(), ProfessionConfiguration.resolveSource(), Configuration.resolveSource(), DotNetPlatform.resolveSource()
  3. In DotNetPlatform.resolveSource(), if the source is a StreamSource, we call getInputStream() on it.

So this code path, unlike the code path in TestSchemaValidator.testSchemaForXslt30 just never tries to resolve it through the catalog resolver.

Actions #19

Updated by Norm Tovey-Walsh over 2 years ago

  • Assignee set to Michael Kay
Actions #21

Updated by Michael Kay over 2 years ago

I've created a unit test that does

            Processor proc = new Processor(true);
            SchemaManager schemaManager = proc.getSchemaManager();
            schemaManager.load(new StreamSource("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
            SchemaValidator validator = schemaManager.newSchemaValidator();
            validator.validate(new StreamSource(new File(configTest.getDataDir(), "books.xsl")));

in both Java and C# versions.

In both cases I can't see any attempt to resolve the initial URI using the catalog resolver. I'm monitoring using Charles and in both cases there appears to be a request to www.w3.org, though the path doesn't seem to be shown (limitation of trial version perhaps?). The difference is that in the Java case the request comes back with 114.9Kb after 1400ms, while in the C# case it comes back with 5.85Kb after 99985ms.

Actions #22

Updated by Michael Kay over 2 years ago

In SaxonJ, following the call to Platform.resolveSource(), we eventually end up with an ActiveSAXSource, containing an InputSource containing only the URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd. We pass this InputSource to parser.parse(), and it comes back with the result after 1.5 seconds.

The xs:import of the second schema document, http://www.w3.org/2001/XMLSchema, is handled by the StandardSchemaResolver, and this successfully invokes the catalog resolver.

In SaxonCS, the call to Platform.resolveSource() invokes new WebClient.OpenRead(new Uri('https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd')) and this times out. I can't see why there should be a difference.

Actions #23

Updated by Michael Kay over 2 years ago

In SaxonJ I have changed SchemaManager.load() so if the supplied Source is a StreamSource with no InputStream or Reader, then the systemId is resolved using the configuration-level resource resolver (which by default uses the catalog). It doesn't use the SchemaURIResolver.

In consequence, in SaxonCS, SchemaManager.Compile(Uri) does the same.

Actions #24

Updated by Michael Kay over 2 years ago

  • Status changed from New to Resolved
  • Applies to branch trunk added
  • Fix Committed on Branch 11, trunk added
  • Platforms Java added
Actions #25

Updated by Debbie Lockett over 2 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 11.4 added

Bug fix applied in the Saxon 11.4 maintenance release.

Please register to edit this issue

Also available in: Atom PDF