Project

Profile

Help

Bug #5580

closed

Saxon-PE 11.3 fails at resolving external entity

Added by Toshihiko Makita almost 2 years ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
Third-party product
Sprint/Milestone:
-
Start date:
2022-06-24
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
11, trunk
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

I'm testing Saxon-PE 10.8 and 11.3 for user project. When I convert XML file using 10.8, it works without no problem. However, 11.3 reports java.net.URISyntaxException. Here is the screen shot. The test has been done on Windows 10 + PowerShell.

PowerShell screen shot

It seems that Windows path notation "..\master\glossary\gls.ent" in ahfsm-custom.ent is not handles properly. I attached the ZIP data archive.

diff-2022-06-23.zip

Reproducing procedure:

  1. Unzip diff-2022-06-23.zip
  2. Maintain JDK path and Saxon-PE path in xmllist/test-pe-10.8.ps1 and test-pe-11.3.ps1
  3. At folder xmllist, open PowerShell
  4. Enter command "./test-pe-10.8.ps1". This command will end normally.
  5. Enter command "./test-pe-11.3.ps1". This command will end with exception.

Hope this helps to fix the 11.3 problem.


Files

2022-06-24-2.png (92.8 KB) 2022-06-24-2.png PowerShell screen shot Toshihiko Makita, 2022-06-24 03:45
diff-2022-06-23.zip (35.3 KB) diff-2022-06-23.zip Test data archive Toshihiko Makita, 2022-06-24 03:51
2022-06-29-9.png (60.5 KB) 2022-06-29-9.png Toshihiko Makita, 2022-06-29 13:32
Actions #1

Updated by Michael Kay almost 2 years ago

  • Category set to Third-party product
  • Assignee set to Norm Tovey-Walsh
  • Applies to branch 11, trunk added
Actions #2

Updated by Norm Tovey-Walsh almost 2 years ago

I'm testing Saxon-PE 10.8 and 11.3 for user project. When I convert
XML file using 10.8, it works without no problem. However, 11.3
reports java.net.URISyntaxException. Here is the screen shot. The test
has been done on Windows 10 + PowerShell.

Thanks for the test case. I’ll take a look. I’m suspicious that there’s
an issue with the local encoding. From the screen shot, apparently
“..\master\glossary\gls.ent” is rendenered by Windows as
“..¥master¥glossary¥gls.ent” which is a little worrisome.

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

Actions #3

Updated by Toshihiko Makita almost 2 years ago

Thank you for your notification.

I’m suspicious that there’s an issue with the local encoding.

It is known font problem specific to Japanese fonts used to display on Windows.

Backslash & Yen sign behavior

We (Japanese) are so accustomed with this text output result. But people outside Japan will worry about encoding.

Hope this helps your understanding.

Regards,

Actions #4

Updated by Norm Tovey-Walsh almost 2 years ago

On closer inspection, I can see what the problem is. In XML, system identifiers aren't filenames, they're URIs. Unescaped backslashes are not valid characters in a URI.

In Saxon 11, we updated the XML resolver used in the product and made catalogs available by default. The new resolver treats the URIs as java.net.URI objects where the old resolver just carried them around as strings. Since the Java URI class is enforcing constraints imposed by the URI specification, it's not clear that there's much we can change.

The easiest workaround is to replace the "\" characters in your system identifiers with either "/" characters or encoded backslashes, "%5C".

If neither of those workarounds is practical, let me know and I'll try to think of another answer.

Actions #5

Updated by Norm Tovey-Walsh almost 2 years ago

  • Status changed from New to AwaitingInfo
  • Priority changed from Low to Normal
Actions #6

Updated by Toshihiko Makita almost 2 years ago

In XML, system identifiers aren't filenames, they're URIs.

You are absolutely right. The problem is Adobe FrameMaker. The relevant user has been used Adobe FrameMaker over 20 years. As a result, there is tons of this path notations in the CMS. So, it is very difficult to tell user that this notation is not right as URI. This is the most headache problem for me.

Actions #7

Updated by Norm Tovey-Walsh almost 2 years ago

  • Status changed from AwaitingInfo to In Progress

Okay. I'll add a feature to the XML Resolver to fix this problem. I won't be surprised if it happens to other users as well.

Actions #8

Updated by Norm Tovey-Walsh almost 2 years ago

  • Status changed from In Progress to Resolved

I've published XML Resolver 4.4.0 which includes an option to address this problem. Use the "FIX_WINDOWS_SYSTEM_IDENTIFIERS" feature.

For example, you can set it with a system property:

java "-Dxml.catalog.fixWindowsSystemIdentifiers=true" -cp ...

You can also set it in a configuration file or via the API, depending on what makes the most sense in your environment. You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version. Instructions about that are now on the Saxonica website: https://www.saxonica.com/html/documentation11/about/installationjava/jarfiles.html

Please let me know if you continue to have difficulty.

Actions #9

Updated by Toshihiko Makita almost 2 years ago

Thank you for your quick fix!!! Very appreciated.

You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version.

Is 4.4.0 version already published in Saxonica Web site?

Actions #10

Updated by Norm Tovey-Walsh almost 2 years ago

Thank you for your quick fix!!! Very appreciated.

You'll need to swap out the XML Resolver 4.2.0 library for the 4.4.0 version.

Is 4.4.0 version already published in Saxonica Web site?

No, I hadn’t considered copying it to the Saxonica web site. You can get
it from Maven or from

https://github.com/xmlresolver/xmlresolver/releases/tag/4.4.0

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

Actions #11

Updated by Toshihiko Makita almost 2 years ago

Thank you, it worked like a charm!

VSCode terminal window

Actions #12

Updated by O'Neil Delpratt over 1 year ago

  • Status changed from Resolved to Closed

Closing this bug issue as it is related to the XML Resolver

Actions #13

Updated by Stefan Krause 7 months ago

»In XML, system identifiers aren't filenames, they're URIs.« I'm not sure about this. The specification (https://www.w3.org/TR/xml/#dt-sysid) says about system identifier

  • »It is meant to be converted to a URI reference« and
  • »SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")«.

In spite of the specification, a lot of software (including Saxon 9.9) consumes or produces XML with DOS\Windows system identifiers. I would recommend to make "-Dxml.catalog.fixWindowsSystemIdentifiers=true" the default behaviour of xmlresolver and/or Saxon.

Actions #14

Updated by Michael Kay 7 months ago

It's true that the specification says that the system identifier "is meant to be converted to a URI reference". But it also says how that conversion should be done, in the next paragraph: System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 3986], must be escaped before a URI can be used to retrieve the referenced resource. - so backslash should be converted to %5C, not to /.

However, the problem here is that sufficiently many popular software products have ignored the specification, so there is pressure on others to be bug-compatible. Our tendency is to push against that pressure because it leads to chaos and unpredictability, so our preference is generally to be strictly conformant by default, and allow deviations to be switched on as options.

The notion of "a string that can be turned into a valid URI by percent-encoding" is quite widespread across the family of XML specifications, but lacks a simple name. I like to call it a "wannabe URI".

Actions #15

Updated by Stefan Krause 7 months ago

It would be nice if you could make the -Dxml.catalog.fixWindowsSystemIdentifiers=true option available in the Saxon configuration file. This would prevent us from change dozens of Saxon calls in our software.

Actions #16

Updated by Michael Kay 7 months ago

It would be nice if...

It wouldn't be very nice if setting the property for one Saxon Configuration affected the property setting for other Saxon Configurations in the same application.

Actions #17

Updated by Norm Tovey-Walsh 7 months ago

Saxonica Developer Community writes:

[[PGP Signed Part:No public key for 7D575AACD7CD3CBE created at 2023-09-22T18:12:21+0100 using RSA]]

Actions #18

Updated by Stefan Krause 6 months ago

Michael Kay wrote in #note-14:

It's true that the specification says that the system identifier "is meant to be converted to a URI reference". But it also says how that conversion should be done, in the next paragraph: System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 3986], must be escaped before a URI can be used to retrieve the referenced resource. - so backslash should be converted to %5C, not to /.

Yes. But I think that the converted system identifier has to be passed to the catalog resolver. Throwing an error is IMHO wrong here. See https://www.w3.org/XML/xml-V10-2e-errata#E4: »The fact that the XML processor is responsible for escaping disallowed characters when resolving URI references was lost in the modifications of the 2nd edition.«

The catalog resolver itself performs the same operation on catalogs, so everything should be fine.

Please register to edit this issue

Also available in: Atom PDF