Project

Profile

Help

Bug #6372

open

Unable to parse Windows-1252 encoded XML files on Linux

Added by Matt Patterson about 1 month ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Category:
-
Start date:
2024-03-15
Due date:
% Done:

0%

Estimated time:
Found in version:
12.4.2
Fixed in version:
Platforms:

Description

From the forum (https://saxonica.plan.io/boards/4/topics/9617):

working with version 12.4.2 on Linux and having this simplified C++ code to explain what I do:

  SaxonProcessor *processor = new SaxonProcessor(true);
  Xslt30Processor *trans = processor->newXslt30Processor();
  XsltExecutable *executable = executable = trans->compileFromFile("/tmp/test.xsl");
  executable->setInitialMatchSelectionAsFile("/tmp/file.xml");
  const char *output = executable->applyTemplatesReturningString();

My file.xml header is like this:

<?xml version="1.0" encoding="windows-1252" standalone="no"?>

I get the following exception running my program:

  SXXP0003  I/O error reported by XML parser processing
  file:///tmp/file1.xml. Caused by
  java.io.UnsupportedEncodingException: Cp1252

My Linux its locale is en_US.UTF-8. Using XML files with utf-8 or iso-8859-1 encodings all work fine.

The same program and input files with windows-1252 encoding on Windows work though. I face this problem only on Linux.


Files

saxonc12xmlparse-test1.py (379 Bytes) saxonc12xmlparse-test1.py Martin Honnen, 2024-03-15 16:14
windows-notepad-ansi-sample1.xml (131 Bytes) windows-notepad-ansi-sample1.xml Martin Honnen, 2024-03-15 16:14
sample1.xml (18 Bytes) sample1.xml Martin Honnen, 2024-03-15 16:14
Actions #1

Updated by Matt Patterson about 1 month ago

If you can add more information about your Linux setup, that would be really helpful:

  • Distro
  • Architecture
  • Version

If you have a sample XML file that triggers the problem, please upload it here if you can, that would be very useful too...

Thanks, Matt.

Actions #2

Updated by Martin Honnen about 1 month ago

Hi Matt,

I am not the one who found the problem but I can kind of reproduce it with Window 11 x64 WSL Ubuntu 22.04 (kind of as I didn't compile/run C++ but just the command tool query and some Python test program).

Some details:

(saxonche1242) mh@LIBERTYDELL23:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

(saxonche1242) mh@LIBERTYDELL23:~$ cat /etc/default/locale
LANG=C.UTF-8

(saxonche1242) mh@LIBERTYDELL23:~$ cat windows-notepad-ansi-sample1.xml
<?xml version="1.0" encoding="Windows-1252"?>
<root xml:lang="de">Dies ist ein Test: machen Umlaute wie �, �, � etwa �rger?</root>

(saxonche1242) mh@LIBERTYDELL23:~$ java -cp /mnt/c/Program\ Files/Saxonica/SaxonHE12-4J/saxon-he-12.4.jar net.sf.saxon.Query -qs:"." -s:windows-notepad-ansi-sample1.xml
<?xml version="1.0" encoding="UTF-8"?><root xml:lang="de">Dies ist ein Test: machen Umlaute wie ä, ö, ü etwa Ärger?</root>

(saxonche1242) mh@LIBERTYDELL23:~$ ./libsaxon-HEC-linux-amd64-v12.4.2/command/query -qs:"." -s:windows-notepad-ansi-sample1.xml
Query processing failed: I/O error reported by XML parser processing file:/home/mh/windows-notepad-ansi-sample1.xml

(saxonche1242) mh@LIBERTYDELL23:~$ python saxonc12xmlparse-test1.py
SaxonC-HE 12.4.2 from Saxonica
<root>test</root>
SXXP0003: I/O error reported by XML parser processing file:///home/mh/windows-notepad-ansi-sample1.xml. Line number: -1

(saxonche1242) mh@LIBERTYDELL23:~$ cat sample1.xml
<root>test</root>

I attach the XML samples (sample1.xml is just to test that normal XML parsing works) and the Python sample.

Actions #3

Updated by Martin Honnen about 1 month ago

Just to contrast the result running from Windows with the same Python sample and XML samples:

PS Microsoft.PowerShell.Core\FileSystem::\\wsl.localhost\Ubuntu-22.04\home\mh> python .\saxonc12xmlparse-test1.py
SaxonC-HE 12.4.2 from Saxonica
<root>test</root>
<root xml:lang="de">Dies ist ein Test: machen Umlaute wie ä, ö, ü etwa Ärger?</root>
Actions #4

Updated by Stephan Bielmann about 1 month ago

Thank you very much for having a look into it to both of you. I work on Oracle Linux Server release 9.3, 64bit Intel. Let me know if you need more information, e.g. locale, installed locales maybe ?

Actions #5

Updated by Matt Patterson about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to Matt Patterson
  • Priority changed from Low to Normal
  • Found in version set to 12.4.2

I can reproduce this on macOS with the Python wrapper, so I have a better sense of what kind of problem this is and where it's located.

Actions #6

Updated by Matt Patterson about 1 month ago

Using just the bits of the Java code that the C++ parse code from the example uses, I can parse an XML file with a charset declaration of windows-1252 fine. That suggests that there's a problem with the version of the parser we're bundling with Saxon C. I am investigating further.

Actions #7

Updated by Martin Honnen about 1 month ago

Perhaps (kind of a wild guess after a broad search by Google to find stuff related to GraalVM and unsupported encodings) it helps to use native-image -H:+AddAllCharsets when building SaxonC to get better/broader encoding/charset support.

Actions #8

Updated by Matt Patterson about 1 month ago

Martin Honnen wrote in #note-7:

Perhaps (kind of a wild guess after a broad search by Google to find stuff related to GraalVM and unsupported encodings) it helps to use native-image -H:+AddAllCharsets when building SaxonC to get better/broader encoding/charset support.

No, that's bang on the money. I'm still digging – there's precious little direct documentation about which character sets are included that I've been able to find so far – but at a guess on Windows they include 1252 by default.

There's other stuff to consider as well – technically, windows-1252 is not a supported charset in XML, but it's worked (thanks to the Xerces team) on Java for years.

I'm also not sure what the size penalty for including all the charsets is, and whether there's even a mechanism for selectively including charsets in a native image library.

There's definitely more to explore with the bundled version of the Xerces parser too – whether the most recent versions might trigger inclusion of the charsets, for one.

Whatever the outcome is, this will require us to make a release to fix, and any workarounds in the meantime will involve parsing the XML files from strings that have been converted to UTF-8, which is trivial on Python but maybe less so under C++.

Actions #9

Updated by Michael Kay about 1 month ago

technically, windows-1252 is not a supported charset in XML

I try to avoid that word "supported" if I possibly can - it has so many different meanings.

Syntactically, XML requires the encoding to have the form [A-Za-z] ([A-Za-z0-9._] | '-')*

Semantically, the spec offers recommendations on what encodings should be accepted and how they should be named (which certainly don't include windows-1252), but the general rule is that a parser can support provide whatever encodings it chooses.

I suspect that Xerces allows you to use any encoding that's recognised by the Java VM, and that this can vary from one JVM (or JVM configuration) to another.

Please register to edit this issue

Also available in: Atom PDF