Bug #6372
closedUnable to parse Windows-1252 encoded XML files on Linux
0%
Description
From the forum (https://saxonica.plan.io/boards/4/topics/9617):
working with version 12.4.2 on Linux and having this simplified C++ code to explain what I do:
SaxonProcessor *processor = new SaxonProcessor(true); Xslt30Processor *trans = processor->newXslt30Processor(); XsltExecutable *executable = executable = trans->compileFromFile("/tmp/test.xsl"); executable->setInitialMatchSelectionAsFile("/tmp/file.xml"); const char *output = executable->applyTemplatesReturningString();
My file.xml header is like this:
<?xml version="1.0" encoding="windows-1252" standalone="no"?>
I get the following exception running my program:
SXXP0003 I/O error reported by XML parser processing file:///tmp/file1.xml. Caused by java.io.UnsupportedEncodingException: Cp1252
My Linux its locale is en_US.UTF-8. Using XML files with utf-8 or iso-8859-1 encodings all work fine.
The same program and input files with windows-1252 encoding on Windows work though. I face this problem only on Linux.
Files
Updated by Matt Patterson 10 months ago
If you can add more information about your Linux setup, that would be really helpful:
- Distro
- Architecture
- Version
If you have a sample XML file that triggers the problem, please upload it here if you can, that would be very useful too...
Thanks, Matt.
Updated by Martin Honnen 10 months ago
- File saxonc12xmlparse-test1.py saxonc12xmlparse-test1.py added
- File windows-notepad-ansi-sample1.xml windows-notepad-ansi-sample1.xml added
- File sample1.xml sample1.xml added
Hi Matt,
I am not the one who found the problem but I can kind of reproduce it with Window 11 x64 WSL Ubuntu 22.04 (kind of as I didn't compile/run C++ but just the command tool query and some Python test program).
Some details:
(saxonche1242) mh@LIBERTYDELL23:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
(saxonche1242) mh@LIBERTYDELL23:~$ cat /etc/default/locale
LANG=C.UTF-8
(saxonche1242) mh@LIBERTYDELL23:~$ cat windows-notepad-ansi-sample1.xml
<?xml version="1.0" encoding="Windows-1252"?>
<root xml:lang="de">Dies ist ein Test: machen Umlaute wie �, �, � etwa �rger?</root>
(saxonche1242) mh@LIBERTYDELL23:~$ java -cp /mnt/c/Program\ Files/Saxonica/SaxonHE12-4J/saxon-he-12.4.jar net.sf.saxon.Query -qs:"." -s:windows-notepad-ansi-sample1.xml
<?xml version="1.0" encoding="UTF-8"?><root xml:lang="de">Dies ist ein Test: machen Umlaute wie ä, ö, ü etwa Ärger?</root>
(saxonche1242) mh@LIBERTYDELL23:~$ ./libsaxon-HEC-linux-amd64-v12.4.2/command/query -qs:"." -s:windows-notepad-ansi-sample1.xml
Query processing failed: I/O error reported by XML parser processing file:/home/mh/windows-notepad-ansi-sample1.xml
(saxonche1242) mh@LIBERTYDELL23:~$ python saxonc12xmlparse-test1.py
SaxonC-HE 12.4.2 from Saxonica
<root>test</root>
SXXP0003: I/O error reported by XML parser processing file:///home/mh/windows-notepad-ansi-sample1.xml. Line number: -1
(saxonche1242) mh@LIBERTYDELL23:~$ cat sample1.xml
<root>test</root>
I attach the XML samples (sample1.xml is just to test that normal XML parsing works) and the Python sample.
Updated by Martin Honnen 10 months ago
Just to contrast the result running from Windows with the same Python sample and XML samples:
PS Microsoft.PowerShell.Core\FileSystem::\\wsl.localhost\Ubuntu-22.04\home\mh> python .\saxonc12xmlparse-test1.py
SaxonC-HE 12.4.2 from Saxonica
<root>test</root>
<root xml:lang="de">Dies ist ein Test: machen Umlaute wie ä, ö, ü etwa Ärger?</root>
Updated by Stephan Bielmann 10 months ago
Thank you very much for having a look into it to both of you. I work on Oracle Linux Server release 9.3, 64bit Intel. Let me know if you need more information, e.g. locale, installed locales maybe ?
Updated by Matt Patterson 10 months ago
- Status changed from New to In Progress
- Assignee set to Matt Patterson
- Priority changed from Low to Normal
- Found in version set to 12.4.2
I can reproduce this on macOS with the Python wrapper, so I have a better sense of what kind of problem this is and where it's located.
Updated by Matt Patterson 10 months ago
Using just the bits of the Java code that the C++ parse code from the example uses, I can parse an XML file with a charset declaration of windows-1252
fine. That suggests that there's a problem with the version of the parser we're bundling with Saxon C. I am investigating further.
Updated by Martin Honnen 10 months ago
Perhaps (kind of a wild guess after a broad search by Google to find stuff related to GraalVM and unsupported encodings) it helps to use native-image -H:+AddAllCharsets
when building SaxonC to get better/broader encoding/charset support.
Updated by Matt Patterson 10 months ago
Martin Honnen wrote in #note-7:
Perhaps (kind of a wild guess after a broad search by Google to find stuff related to GraalVM and unsupported encodings) it helps to use
native-image -H:+AddAllCharsets
when building SaxonC to get better/broader encoding/charset support.
No, that's bang on the money. I'm still digging – there's precious little direct documentation about which character sets are included that I've been able to find so far – but at a guess on Windows they include 1252 by default.
There's other stuff to consider as well – technically, windows-1252
is not a supported charset in XML, but it's worked (thanks to the Xerces team) on Java for years.
I'm also not sure what the size penalty for including all the charsets is, and whether there's even a mechanism for selectively including charsets in a native image library.
There's definitely more to explore with the bundled version of the Xerces parser too – whether the most recent versions might trigger inclusion of the charsets, for one.
Whatever the outcome is, this will require us to make a release to fix, and any workarounds in the meantime will involve parsing the XML files from strings that have been converted to UTF-8, which is trivial on Python but maybe less so under C++.
Updated by Michael Kay 10 months ago
technically, windows-1252 is not a supported charset in XML
I try to avoid that word "supported" if I possibly can - it has so many different meanings.
Syntactically, XML requires the encoding to have the form [A-Za-z] ([A-Za-z0-9._] | '-')*
Semantically, the spec offers recommendations on what encodings should be accepted and how they should be named (which certainly don't include windows-1252), but the general rule is that a parser can support provide whatever encodings it chooses.
I suspect that Xerces allows you to use any encoding that's recognised by the Java VM, and that this can vary from one JVM (or JVM configuration) to another.
Updated by Matt Patterson 8 months ago
- Category set to Graalvm build
- Status changed from In Progress to Resolved
native-image build parameters changed to always include all charsets.
Verified fixed, will be shipped in the next maintenance release.
Updated by O'Neil Delpratt 7 months ago
- Status changed from Resolved to Closed
- Fixed in version set to 12.5
Bug fix applied in the Saxon 12.5 Maintenance release.
Please register to edit this issue