Bug #5336
closedHow to load/compile XSD 1.1 schema for XSLT 3 with SaxonCS 11.1?
100%
Description
With SaxonJ EE 11.1 I can run
Processor processor = new Processor(true);
processor.getSchemaManager().load(new StreamSource("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
without problems and it takes a such a short amount of time that I am sure the schema is loaded with the XmlResolver and from its XmlResolverData.
With SaxonCS 11.1, however, the similar code
Processor processor = new Processor(true);
processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
gives
Saxon.Api.SaxonApiException
HResult=0x80131500
Message=: Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
Source=SaxonCS
StackTrace:
at Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
at Saxon.Api.SchemaManager.Compile(Uri uri)
at SaxonCSCompileXSLT3SchemaTest.Program.Main(String[] args) in C:\SomeDir\SaxonCSCompileXSLT3SchemaTest\Program.cs:line 14
and takes a lot of time to give that error so I assume it might try to download the file from the W3C server.
Even when trying to set
Processor processor = new Processor(true);
(processor.Implementation.getConfigurationProperty(FeatureKeys.RESOURCE_RESOLVER) as CatalogResourceResolver).setFeature(ResolverFeature.URI_FOR_SYSTEM, true);
processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
it takes a lot of time and gives the same error:
Saxon.Api.SaxonApiException
HResult=0x80131500
Message=: Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
Source=SaxonCS
StackTrace:
at Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
at Saxon.Api.SchemaManager.Compile(Uri uri)
at SaxonCSCompileXSLT3SchemaTest.Program.Main(String[] args) in C:\SomeDir\SaxonCSCompileXSLT3SchemaTest\Program.cs:line 16
So how do I get SaxonCS 11.1 to load/compile the schema for XSLT 3.0 and hopefully have it load the XSD using the catalog and its data dll?
Updated by Michael Kay almost 3 years ago
I've added this as a unit test.
The method SchemaManager.Compile(Uri)
hasn't been updated to use the new Resolver infrastructure.
Also it would be more consistent for SchemaManager to have a ResourceResolver property that can be set, rather than relying on the legacy SchemaResolver
.
Updated by Michael Kay almost 3 years ago
I tried a workaround using DocumentBuilder.Build(uri)
followed by SchemaManager.Compile(XdmNode)
, but DocumentBuilder.Build(uri)
suffers the same problem.
I can retrieve the document from the local cache using
XQueryCompiler compiler = proc.NewXQueryCompiler();
XQueryEvaluator eval = compiler.Compile("doc('https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd')").Load();
XdmNode schema = (XdmNode)eval.Evaluate();
But I still get some kind of failure when doing Compile(XdmNode)
on the result.
Updated by Michael Kay almost 3 years ago
This latest problem is the XmlResolverWrappingResourceResolver
class throwing "invalid URI" when the XmlReader throws it the public ID -//W3C//DTD XSD 1.1//EN
. This path doesn't seem to include the usual hack of recognising this as a Public ID rather than a URI.
Updated by Michael Kay almost 3 years ago
I've added unit tests on the Java side that use all three mechanisms (directly supplying a URI in a StreamSource, or getting an XdmNode via a DocumentBuilder or via an XQuery), and all three work correctly.
So how does the Java code differ from the C# code? They diverge at SchemaReader.sendSchemaSource()
, which in SaxonCS is converted to a call on Sender.send()
. But the divergent Java code is only setting up an XMLReader.
The real difference is that the StreamSource (containing only a URI) is passed to Platform.resolveSource() which has different implementations for SaxonJ and SaxonCS. On the SaxonCS side, it directly resolves the URI using WebClient.OpenRead()
. On Java, it constructs an ActiveStreamSource
that wraps the StreamSource
, and passes this to the XMLReader. I'm finding it difficult to work out how the XMLReader handles it; I don't see any callback to the CatalogResolver, and yet it's coming back to quickly to have been out to the web. HTTP monitoring using Charles suggests that there is a call out to www.w3.org, but I can't see that it's requesting this URL.
Back over on the .NET side, it's very clear in Charles that we're making a real HTTP request to retrieve http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd - not the original schema document, but the DTD that it references.
Updated by Martin Honnen over 2 years ago
So what is the final resolution here for the .NET/C# side? Shouldn't Saxon get that http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd from its XmlResoverData.dll cache instead of trying to pull from W3C?
Updated by Martin Honnen over 2 years ago
Probably related, using SaxonCS from the command line to try to validate the XSD 1.1 schema against the XSD 1.1 schema fails:
& 'C:\Program Files\Saxonica\SaxonCS-11.3\SaxonCS.exe' validate -t -s:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -xsd:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd
SaxonCS-EE 11.3 from Saxonica
.NET 5.0.9 on Windows 10.0.22000.0
Using license serial number ...
URIResolver for schema file must return a Source
Exiting with code 2
When I use a .NET 6 wrapper command tool that does nothing more than calling Saxon.Cmd.Command.Main(args.Prepend("validate").ToArray());
I interestingly enough get a different error:
saxonvalidate -s:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -xsd:https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd -t
SaxonCS-EE 11.3 from Saxonica
.NET 6.0.6 on Windows 10.0.22000.0
Using license serial number ...
Loading schema document https://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd
: Cannot resolve external DTD subset - public ID = '-//W3C//DTD XSD 1.1//EN', system ID = 'XMLSchema.dtd'.
Fatal error during validation: : Cannot resolve external DTD subset - public ID = '-//W3C//DTD XSD 1.1//EN', system ID = 'XMLSchema.dtd'.
Exiting with code 2
So its seems the Xml resolver / resolver cache fails. Norm Tovey-Walsh, any idea on that.
Updated by Michael Kay over 2 years ago
- Tracker changed from Support to Bug
I'm elevating this to a Bug. We have a SaxonCS unit test TestSchemaValidator.testSchemaForXslt30
which is failing (or running extremely slowly) as a result of this problem.
Updated by Michael Kay over 2 years ago
I'm investigating the unit test TestSchemaValidator.testSchemaForXslt30
, which attempts to load the XSLT30 schema using a doc() call issued from an XQuery (and fails).
The XSLT30 schema URI is being successfully resolved by the catalog resolver, and returns with a ManifestResourceStream delivering the content from the XmlResolverData. We then attempt to parse this stream, using an XmlTextReader in which the XmlResolver is set to an instance of XmlResolverWrappingResourceResolver.
We see a callback from the XmlTextReader to this XmlResolver - it's calling ResolveUri() with a baseUri of null and a relativeUri() of .../XMLSchema.xsd
originating internally from DtdParserProxy.get_DtdParserProxy_BaseUri(). This successfully returns the URI unchanged.
Next we see a call on ResolveUri with baseUri equal to the .../XMLSchema.xsd
URI and relativeURI being the public ID -//W3C//DTD XSD 1.1//EN
. We detect this as a Public ID and pass it back as unchanged as a Uri object.
Next we see a call on GetEntity with absoluteUri being the public ID -//W3C//DTD XSD 1.1//EN
. We call the catalog resolver which looks this up and sets uri2=pack://application:,,,XmlResolverData;0.2.0.0;component/www_w3_org.2009.XMLSchema.XMLSchema.dtd
. The resolvedResourceImpl is null, so we return null from the GetEntity() call.
Next we see a call on ResolveUri with baseUri being http://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd
and relativeUri being XMLSchema.dtd
. This correctly returns http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd
. It appears the XmlTextReader attempts to fetch this URI itself without a further call to GetEntity()
.
Updated by Norm Tovey-Walsh over 2 years ago
If I'm understanding correctly,
Next we see a call on GetEntity with absoluteUri being the public ID -//W3C//DTD XSD 1.1//EN. We call the catalog resolver which looks this up and sets uri2=pack://application:,,,XmlResolverData;0.2.0.0;component/www_w3_org.2009.XMLSchema.XMLSchema.dtd. The resolvedResourceImpl is null, so we return null from the GetEntity() call.
This is the part that sounds like a bug. If we've worked out that the dtd is in a pack://
URI, then why is resolvedResourceImpl
returned null, I wonder? I'll have to build CS and see if I can get it running under the debugger...
Updated by Norm Tovey-Walsh over 2 years ago
Yes, it appears that http://www.w3.org/TR/xmlschema11-1/XMLSchema.dtd
isn't in the data assembly.
Updated by Norm Tovey-Walsh over 2 years ago
Wow, the W3C site seems to be very confused about what is available and where wrt schema validation.
https://www.w3.org/TR/xmlschema-2/datatypes.xsd
Isn't an XSD file, it's something wrapped in a pre
. Even if you unwrapped the pre
it would be wrong because the XML declaration is after the DTD fragment.
Updated by Michael Kay over 2 years ago
That's how it appears in Safari. But if you look at it using curl, it starts
<?xml version='1.0'?>
<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XSD 1.1//EN" "XMLSchema.dtd" [
<!-- provide ID type information even for parsers which only read the
internal subset -->
<!ATTLIST xs:schema id ID #IMPLIED>
<!ATTLIST xs:complexType id ID #IMPLIED>
Updated by Michael Kay over 2 years ago
Oh, sorry, that was
https://www.w3.org/TR/xmlschema11-2/XMLSchema.xsd
But the 1.0 version is very similar in curl:
<?xml version='1.0' encoding='UTF-8'?>
<!-- XML Schema schema for XML Schemas: Part 1: Structures -->
<!-- Note this schema is NOT the normative structures schema. -->
<!-- The prose copy in the structures REC is the normative -->
<!-- version (which shouldn't differ from this one except for -->
<!-- this comment and entity expansions, but just in case -->
<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN" "XMLSchema.dtd" [
<!-- provide ID type information even for parsers which only read the
internal subset -->
<!ATTLIST xs:schema id ID #IMPLIED>
<!ATTLIST xs:complexType id ID #IMPLIED>
<!ATTLIST xs:complexContent id ID #IMPLIED>
Updated by Norm Tovey-Walsh over 2 years ago
XMLSchema.xsd
is fine, but datatypes.xsd
, that's another story:
$ curl -s https://www.w3.org/TR/xmlschema-2/datatypes.xsd | head
<pre><![CDATA[<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSCHEMA 200102//EN" "XMLSchema.dtd" [
<!--
keep this schema XML1.0 DTD valid
-->
<!ENTITY % schemaAttrs 'xmlns:hfp CDATA #IMPLIED'>
<!ELEMENT hfp:hasFacet EMPTY>
<!ATTLIST hfp:hasFacet
name NMTOKEN #REQUIRED>
...
Note the initial <pre>
.
FYI: I have reported this to webreq@w3.org
.
Updated by Norm Tovey-Walsh over 2 years ago
The root cause of the slowness was the fact that the W3C XML Schema DTDs are being retrieved from an unofficial location that I didn't know existed. I've updated the XML Resolver data jar and assembly to correct the oversight. If you update the NuGet dependency for XmlResolverData
to 1.2.0
, I believe the DTD will be accessed without dereferencing from the W3C site.
Updated by Martin Honnen over 2 years ago
Hi Norm,
for me, with SaxonCS 11.3, .NET 6 and XmlResolverData 1.2 the code
using Saxon.Api;
Processor processor = new Processor(true);
processor.SchemaManager.Compile(new Uri("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
still hangs for quite some time to then gives an exception
Saxon.Api.SaxonApiException
HResult=0x80131500
Nachricht = : Unable to retrieve URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
Quelle = SaxonCS
Stapelüberwachung:
bei Saxon.Eej.ee.s9api.SchemaManagerImpl.load(Source source)
bei Program.<Main>$(String[] args) in C:\SomePath\SaxonCSCompileXSLT30Schema\SaxonCSCompileXSLT30Schema\Program.cs: Zeile5
Updated by Norm Tovey-Walsh over 2 years ago
Okay. I'll investigate. I was following Mike's lead on an existing test case which definitely did get faster.
Updated by Norm Tovey-Walsh over 2 years ago
AFAICT (and that might not be very far),
processor.SchemaManager.Compile(someURI)
makes no effort to resolve the resource through the resolver.
-
Compile()
creates aStreamSource
directly from the URI - We go on a long journey:
SchemaManagerImpl.load()
,EnterpriseConfiguration.addSchemaSource()
,SchemaReader.read()
,SchemaReader.buildSchemaDocument()
,schemaReader.sendSchemaSource()
,Sender.send()
,ProfessionConfiguration.resolveSource()
,Configuration.resolveSource()
,DotNetPlatform.resolveSource()
- In
DotNetPlatform.resolveSource()
, if the source is aStreamSource
, we callgetInputStream()
on it.
So this code path, unlike the code path in TestSchemaValidator.testSchemaForXslt30
just never tries to resolve it through the catalog resolver.
Updated by Michael Kay over 2 years ago
I've created a unit test that does
Processor proc = new Processor(true);
SchemaManager schemaManager = proc.getSchemaManager();
schemaManager.load(new StreamSource("https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd"));
SchemaValidator validator = schemaManager.newSchemaValidator();
validator.validate(new StreamSource(new File(configTest.getDataDir(), "books.xsl")));
in both Java and C# versions.
In both cases I can't see any attempt to resolve the initial URI using the catalog resolver. I'm monitoring using Charles and in both cases there appears to be a request to www.w3.org, though the path doesn't seem to be shown (limitation of trial version perhaps?). The difference is that in the Java case the request comes back with 114.9Kb after 1400ms, while in the C# case it comes back with 5.85Kb after 99985ms.
Updated by Michael Kay over 2 years ago
In SaxonJ, following the call to Platform.resolveSource()
, we eventually end up with an ActiveSAXSource, containing an InputSource containing only the URI https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd
. We pass this InputSource to parser.parse(), and it comes back with the result after 1.5 seconds.
The xs:import of the second schema document, http://www.w3.org/2001/XMLSchema
, is handled by the StandardSchemaResolver, and this successfully invokes the catalog resolver.
In SaxonCS, the call to Platform.resolveSource()
invokes new WebClient.OpenRead(new Uri('https://www.w3.org/TR/xslt-30/schema-for-xslt30.xsd'))
and this times out. I can't see why there should be a difference.
Updated by Michael Kay over 2 years ago
In SaxonJ I have changed SchemaManager.load() so if the supplied Source is a StreamSource with no InputStream or Reader, then the systemId is resolved using the configuration-level resource resolver (which by default uses the catalog). It doesn't use the SchemaURIResolver.
In consequence, in SaxonCS, SchemaManager.Compile(Uri) does the same.
Updated by Michael Kay over 2 years ago
- Status changed from New to Resolved
- Applies to branch trunk added
- Fix Committed on Branch 11, trunk added
- Platforms Java added
Updated by Debbie Lockett over 2 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 11.4 added
Bug fix applied in the Saxon 11.4 maintenance release.
Please register to edit this issue