Bug #5580
closedSaxon-PE 11.3 fails at resolving external entity
0%
Description
I'm testing Saxon-PE 10.8 and 11.3 for user project. When I convert XML file using 10.8, it works without no problem. However, 11.3 reports java.net.URISyntaxException. Here is the screen shot. The test has been done on Windows 10 + PowerShell.
It seems that Windows path notation "..\master\glossary\gls.ent" in ahfsm-custom.ent is not handles properly. I attached the ZIP data archive.
diff-2022-06-23.zip
Reproducing procedure:
- Unzip diff-2022-06-23.zip
- Maintain JDK path and Saxon-PE path in xmllist/test-pe-10.8.ps1 and test-pe-11.3.ps1
- At folder xmllist, open PowerShell
- Enter command "./test-pe-10.8.ps1". This command will end normally.
- Enter command "./test-pe-11.3.ps1". This command will end with exception.
Hope this helps to fix the 11.3 problem.
Files
Updated by Michael Kay over 2 years ago
- Category set to Third-party product
- Assignee set to Norm Tovey-Walsh
- Applies to branch 11, trunk added
Updated by Norm Tovey-Walsh over 2 years ago
I'm testing Saxon-PE 10.8 and 11.3 for user project. When I convert
XML file using 10.8, it works without no problem. However, 11.3
reports java.net.URISyntaxException. Here is the screen shot. The test
has been done on Windows 10 + PowerShell.
Thanks for the test case. I’ll take a look. I’m suspicious that there’s
an issue with the local encoding. From the screen shot, apparently
“..\master\glossary\gls.ent” is rendenered by Windows as
“..¥master¥glossary¥gls.ent” which is a little worrisome.
Be seeing you,
norm
--
Norm Tovey-Walsh
Saxonica
Updated by Toshihiko Makita over 2 years ago
Thank you for your notification.
I’m suspicious that there’s an issue with the local encoding.
It is known font problem specific to Japanese fonts used to display on Windows.
We (Japanese) are so accustomed with this text output result. But people outside Japan will worry about encoding.
Hope this helps your understanding.
Regards,
Updated by Norm Tovey-Walsh over 2 years ago
On closer inspection, I can see what the problem is. In XML, system identifiers aren't filenames, they're URIs. Unescaped backslashes are not valid characters in a URI.
In Saxon 11, we updated the XML resolver used in the product and made catalogs available by default. The new resolver treats the URIs as java.net.URI
objects where the old resolver just carried them around as strings. Since the Java URI
class is enforcing constraints imposed by the URI specification, it's not clear that there's much we can change.
The easiest workaround is to replace the "\" characters in your system identifiers with either "/" characters or encoded backslashes, "%5C".
If neither of those workarounds is practical, let me know and I'll try to think of another answer.
Updated by Norm Tovey-Walsh over 2 years ago
- Status changed from New to AwaitingInfo
- Priority changed from Low to Normal
Updated by Toshihiko Makita over 2 years ago
In XML, system identifiers aren't filenames, they're URIs.
You are absolutely right. The problem is Adobe FrameMaker. The relevant user has been used Adobe FrameMaker over 20 years. As a result, there is tons of this path notations in the CMS. So, it is very difficult to tell user that this notation is not right as URI. This is the most headache problem for me.
Updated by Norm Tovey-Walsh over 2 years ago
- Status changed from AwaitingInfo to In Progress
Okay. I'll add a feature to the XML Resolver to fix this problem. I won't be surprised if it happens to other users as well.
Updated by Norm Tovey-Walsh over 2 years ago
- Status changed from In Progress to Resolved
I've published XML Resolver 4.4.0 which includes an option to address this problem. Use the "FIX_WINDOWS_SYSTEM_IDENTIFIERS" feature.
For example, you can set it with a system property:
java "-Dxml.catalog.fixWindowsSystemIdentifiers=true" -cp ...
You can also set it in a configuration file or via the API, depending on what makes the most sense in your environment. You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version. Instructions about that are now on the Saxonica website: https://www.saxonica.com/html/documentation11/about/installationjava/jarfiles.html
Please let me know if you continue to have difficulty.
Updated by Toshihiko Makita over 2 years ago
Thank you for your quick fix!!! Very appreciated.
You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version.
Is 4.4.0 version already published in Saxonica Web site?
Updated by Norm Tovey-Walsh over 2 years ago
Thank you for your quick fix!!! Very appreciated.
You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version.
Is 4.4.0 version already published in Saxonica Web site?
No, I hadn’t considered copying it to the Saxonica web site. You can get
it from Maven or from
https://github.com/xmlresolver/xmlresolver/releases/tag/4.4.0
Be seeing you,
norm
--
Norm Tovey-Walsh
Saxonica
Updated by Toshihiko Makita over 2 years ago
- File 2022-06-29-9.png 2022-06-29-9.png added
Thank you, it worked like a charm!
Updated by O'Neil Delpratt almost 2 years ago
- Status changed from Resolved to Closed
Closing this bug issue as it is related to the XML Resolver
Updated by Stefan Krause over 1 year ago
»In XML, system identifiers aren't filenames, they're URIs.« I'm not sure about this. The specification (https://www.w3.org/TR/xml/#dt-sysid) says about system identifier
- »It is meant to be converted to a URI reference« and
- »SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")«.
In spite of the specification, a lot of software (including Saxon 9.9) consumes or produces XML with DOS\Windows system identifiers. I would recommend to make "-Dxml.catalog.fixWindowsSystemIdentifiers=true" the default behaviour of xmlresolver and/or Saxon.
Updated by Michael Kay over 1 year ago
It's true that the specification says that the system identifier "is meant to be converted to a URI reference". But it also says how that conversion should be done, in the next paragraph: System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 3986], must be escaped before a URI can be used to retrieve the referenced resource. - so backslash should be converted to %5C
, not to /
.
However, the problem here is that sufficiently many popular software products have ignored the specification, so there is pressure on others to be bug-compatible. Our tendency is to push against that pressure because it leads to chaos and unpredictability, so our preference is generally to be strictly conformant by default, and allow deviations to be switched on as options.
The notion of "a string that can be turned into a valid URI by percent-encoding" is quite widespread across the family of XML specifications, but lacks a simple name. I like to call it a "wannabe URI".
Updated by Stefan Krause over 1 year ago
It would be nice if you could make the -Dxml.catalog.fixWindowsSystemIdentifiers=true
option available in the Saxon configuration file. This would prevent us from change dozens of Saxon calls in our software.
Updated by Michael Kay over 1 year ago
It would be nice if...
It wouldn't be very nice if setting the property for one Saxon Configuration affected the property setting for other Saxon Configurations in the same application.
Updated by Norm Tovey-Walsh over 1 year ago
Saxonica Developer Community notifications@plan.io writes:
[[PGP Signed Part:No public key for 7D575AACD7CD3CBE created at 2023-09-22T18:12:21+0100 using RSA]]
Updated by Stefan Krause about 1 year ago
Michael Kay wrote in #note-14:
It's true that the specification says that the system identifier "is meant to be converted to a URI reference". But it also says how that conversion should be done, in the next paragraph: System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 3986], must be escaped before a URI can be used to retrieve the referenced resource. - so backslash should be converted to
%5C
, not to/
.
Yes. But I think that the converted system identifier has to be passed to the catalog resolver. Throwing an error is IMHO wrong here. See https://www.w3.org/XML/xml-V10-2e-errata#E4: »The fact that the XML processor is responsible for escaping disallowed characters when resolving URI references was lost in the modifications of the 2nd edition.«
The catalog resolver itself performs the same operation on catalogs, so everything should be fine.
Please register to edit this issue