Bug #6182
open

UTF-8 in string based C API functions
90%
Description
I tried to get the following code running and the encoding of the return value seems off.
void testUTF8StringTemplate(SaxonProcessor *proc, Xslt30Processor *trans,
sResultCount *sresult) {
const char *source =
"<?xml version='1.0' encoding='UTF8'?> <xsl:stylesheet "
"xmlns:xsl='http://www.w3.org/1999/XSL/Transform' "
"xmlns:xs='http://www.w3.org/2001/XMLSchema' version='3.0'> "
"<xsl:template match='*'> <xsl:sequence select=''تيست''/> </xsl:template> </xsl:stylesheet>";
cout << endl << "Test:testUTF8StringTemplate" << endl;
XsltExecutable *executable = trans->compileFromString(source);
if (executable == nullptr) {
if (trans->exceptionOccurred()) {
cout << "Error: " << trans->getErrorMessage() << endl;
}
return;
}
const char* _in = "<?xml version='1.0' encoding='UTF8'?><e>تيست</e>";
XdmNode *node = proc->parseXmlFromString(_in);
executable->setResultAsRawValue(false);
std::map<std::string, XdmValue *> parameterValues;
executable->setInitialTemplateParameters(parameterValues, false);
executable->setInitialMatchSelection(node);
XdmValue *result = executable->applyTemplatesReturningValue();
if (result != nullptr) {
sresult->success++;
cout << "Input=" << _in;
cout << "Result=" << result->getHead()->getStringValue() << endl << node->toString() << endl;
delete result;
} else {
sresult->failure++;
sresult->failureList.push_back("testUTF8StringTemplate");
}
delete executable;
delete node;
parameterValues.clear();
}
Compiled with VS 2017:
cl /utf-8 /EHsc "-I%graalvmdir%" testXSLT30.cpp ../../Saxon.C.API/SaxonCGlue.c ../../Saxon.C.API/SaxonCXPath.c ../../Saxon.C.API/SaxonProcessor.cpp ../../Saxon.C.API/XdmValue.cpp ../../Saxon.C.API/XdmItem.cpp ../../Saxon.C.API/XdmAtomicValue.cpp ../../Saxon.C.API/DocumentBuilder.cpp ../../Saxon.C.API/XdmNode.cpp ../../Saxon.C.API/XdmFunctionItem.cpp ../../Saxon.C.API/XdmArray.cpp ../../Saxon.C.API/XdmMap.cpp ../../Saxon.C.API/SaxonApiException.cpp ../../Saxon.C.API/XQueryProcessor.cpp ../../Saxon.C.API/Xslt30Processor.cpp ../../Saxon.C.API/XsltExecutable.cpp ../../Saxon.C.API/XPathProcessor.cpp ../../Saxon.C.API/SchemaValidator.cpp /link ..\..\libs\win\libsaxon-hec-12.3.lib
Result in an UTF-8 enabled powershell window:
Test:testUTF8StringTemplate
Input=<?xml version='1.0' encoding='UTF8'?><e>تيست</e>Result=تيست
<e>تيست</e>
Any ideas how to fix this? It seems the UTF-8 string is encoded twice.
Updated by Norm Tovey-Walsh about 1 month ago
Thanks for the report. I've been unable to reproduce the results with either SaxonC HE or EE 12.3. Unfortunately, I'm running on a Mac/arm64 machine, not Windows. Testing on Windows will take a bit longer.
Are you using 12.3?
Updated by Omar Siam about 1 month ago
Yes that is with the latest 12.3
Yes, I expected that. I was not able to test this in something else than Windows yet but I assumed it would just work.
Windows is now the only operating system of MacOS, Linux, Windows that does not use UTF-8 as 8-bit character encoding but whatever codepage they used back in the 90s in your region of the world. For me this is Windows 1252.
This is then in turn used by default in Java on Windows.
There are a few problem reports, also on Graal VM from other projects (e.g. quarkus using GraalVM native-image) that had strange effects on Windows only.
Updated by Omar Siam about 1 month ago
Tried to make sure to use the correct codepage in console output on Windows by adding one line at the start of main in testXSLT30.cpp:
int main(int argc, char *argv[]) {
SetConsoleOutputCP(65001);
And for PowerShell to render UTF-8
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Updated by Omar Siam about 1 month ago
Just tried test_saxonc.py
from libsaxon-HEC-windows-amd64-v12.3\pypi
using pipenv with this Pipfile
:
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
pytest = "*"
saxonche = "12.3"
[dev-packages]
[requires]
python_version = "3.11"
pipenv run python -Xutf8 -m pytest test_saxonc.py
In the result there is on failing test that shows this bug:
AssertionError: assert '<!--A-->§<out/>§<!--Z-->' == '<!--A-->§<out/>§<!--Z-->'
================================================= test session starts =================================================
platform win32 -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0
rootdir: Q:\libsaxon-HEC-windows-amd64-v12.3\pypi
[...]
=============================================== short test summary info ===============================================
FAILED test_saxonc.py::testNodeAxis - TypeError: can only concatenate str (not "NoneType") to str
FAILED test_saxonc.py::testCollection - NameError: name 'getcwd' is not defined
FAILED test_saxonc.py::testXquery_40_functions - saxonche.PySaxonApiError: Requested feature (XQuery 4.0) requires Saxon-PE. Line number: -1
FAILED test_saxonc.py::testXdmDestinationWithItemSeparator - AssertionError: assert '<!--A-->§<out/>§<!--Z-->' == '<!--A-->§<out/>§<!--Z-->'
FAILED test_saxonc.py::test_parse_xml_file1 - saxonche.PySaxonApiError: SXXP0003: I/O error reported by XML parser processing /Q:/libsaxon-HEC-windows-amd64-v12....
FAILED test_saxonc.py::test_packages - saxonche.PySaxonApiError: Exporting a stylesheet requires Saxon-EE. Line number: -1
======================================= 6 failed, 79 passed, 2 skipped in 1.31s =======================================
Updated by Omar Siam about 1 month ago
Python test case with arabic letters:
´
python
def testUTF8(saxonproc):
node = saxonproc.parse_xml(xml_text="تيست")
trans = saxonproc.new_xslt30_processor()
executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>")
assert node is not None
assert isinstance(node, PyXdmNode)
assert len(node.children)>0
eNode = node.children[0].children[0]
assert eNode is not None
executable.set_global_context_item(xdm_item=node)
executable.set_initial_match_selection(xdm_value=eNode)
result = executable.apply_templates_returning_string()
assert result is not None
assert "UTF8-تيست: تيست" in result
Result:
FAILED test_saxonc.py::testUTF8 - assert 'UTF8-تيست: تيست' in 'UTF8-تيست: تيست'
Updated by Omar Siam about 1 month ago
Python test case with arabic letters:
def testUTF8(saxonproc):
node = saxonproc.parse_xml(xml_text="<doc><e>تيست</e></doc>")
trans = saxonproc.new_xslt30_processor()
executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>")
assert node is not None
assert isinstance(node, PyXdmNode)
assert len(node.children)>0
eNode = node.children[0].children[0]
assert eNode is not None
executable.set_global_context_item(xdm_item=node)
executable.set_initial_match_selection(xdm_value=eNode)
result = executable.apply_templates_returning_string()
assert result is not None
assert "UTF8-تيست: تيست" in result
Result:
FAILED test_saxonc.py::testUTF8 - assert 'UTF8-تيست: تيست' in '<?xml version="1.0" encoding="UTF-8"?>UTF8-تيست: تيست'
Updated by Martin Honnen about 1 month ago
I can confirm the problem under Windows; it doesn't occur with Linux Ubuntu in the WSL shell.
The damage seems to be already done on parsing as e.g.
node = saxon.parse_xml(xml_text="<root><text xml:lang='de'>Der Preis ist höher als 300€</text></root>")
print(node)
prints
<root>
<text xml:lang="de">Der Preis ist höher als 300€</text>
</root>
The commands directly, however, like Query.exe, seem to be able to handle the characters e.g.
'C:\Program Files\Saxonica\libsaxon-HEC-windows-amd64-v12.3\command\Query.exe' -qs:"<root>Der Preis ist höher als 300€.</root>"
<?xml version="1.0" encoding="UTF-8"?><root>Der Preis ist höher als 300€.</root>
Updated by Martin Honnen about 1 month ago
All SaxonC (12.0, 12.1, 12.2, 12.3) releases for Windows seem to show that problem.
Updated by Omar Siam 27 days ago
OK so I found interfacing GraalVM interesting in itself. I created a test case for my target language FreePascal here
I am guessing here because I could not find the sources for the glue code used in the Saxon-C version of Saxon HE.
The Java part is:
import org.graalvm.nativeimage.IsolateThread;
import org.graalvm.nativeimage.c.function.CEntryPoint;
import org.graalvm.nativeimage.c.type.CCharPointer;
import org.graalvm.nativeimage.c.type.CTypeConversion;
public class Main {
public static void main(String[] args) {
System.out.println("file.encoding:" + System.getProperty("file.encoding"));
System.out.println("UTF-8 text with some international chars:");
System.out.println("äëïöü áéíóú àèiòù Ññ Çç € تيست");
}
@CEntryPoint(name = "print_and_return")
static CCharPointer print_and_return(IsolateThread thread, CCharPointer _in) {
String in = CTypeConversion.toJavaString(_in);
System.out.println("file.encoding:" + System.getProperty("file.encoding"));
System.out.println("String passed as input is:");
System.out.println(in);
return CTypeConversion.toCString(in).get();
}
@CEntryPoint(name = "add_utf8_print_and_return")
static CCharPointer add_utf8_print_and_return(IsolateThread thread, CCharPointer _in) {
String in = CTypeConversion.toJavaString(_in);
System.out.println("file.encoding:" + System.getProperty("file.encoding"));
System.out.println("String passed as input is:");
System.out.println(in);
return CTypeConversion.toCString("“" + in + "”").get();
}
}
I just followed the examples and the JavaDoc
main thread pointer 00000000000C1E40\n
Cur attach thread pointer same 00000000000C1E40\n
file.encoding:UTF-8
String passed as input is:
A Test
file.encoding:UTF-8
String passed as input is:
äöü߀
file.encoding:UTF-8
String passed as input is:
A Test
file.encoding:UTF-8
String passed as input is:
äöü߀
!
All Tests [00.009] : 4 ok
Graal UTF-8 Tests [00.008] : 4 ok
[00.008] : 4 ok
Graal string without UTF-8: A Test [00.005] pass
Graal string with UTF-8: äöü߀ [00.001] pass
Graal string without UTF-8, Java adds UTF-8: “A Test” [00.001] pass
Graal string with UTF-8, Java adds UTF-8: “äöü߀” [00.001] pass
To get a result with my code that looks correct I have to:
- Create the DLL with
-J-Dfile.encoding=UTF-8
. Else at least some output is broken. - I have to set the Console codepage to 65001:
chcp 65001
main thread pointer 0000000001581E80\n
Cur attach thread pointer same 0000000001581E80\n
file.encoding:Cp1252
String passed as input is:
A Test
file.encoding:Cp1252
String passed as input is:
äöü߀
file.encoding:Cp1252
String passed as input is:
A Test
file.encoding:Cp1252
String passed as input is:
äöü߀
!
All Tests [00.007] : 2 ok, 2 failed
Graal UTF-8 Tests [00.006] : 2 ok, 2 failed
[00.006] : 2 ok, 2 failed
Graal string without UTF-8: A Test [00.003] pass
Graal string with UTF-8: äöü߀ [00.001] pass
Graal string without UTF-8, Java adds UTF-8: “A Test” [00.001] fail @?#?: "Result should match input:A Test" expected: <“A Test”> but was: <?A Test?>
Graal string with UTF-8, Java adds UTF-8: “äöü߀” [00.001] fail @?#?: "Result should match input:äöü߀" expected: <“äöü߀”> but was: <?äöü߀?>
Updated by O'Neil Delpratt 27 days ago
Hi,
Apologies I have been away last week so not been to take the lead on this issue. Thanks for everyones comments. I will investigate it further to see how we can correctly handle the encoding on Windows.
Updated by O'Neil Delpratt 26 days ago
- Category set to Saxon-C Internals
- Status changed from New to In Progress
- Assignee set to O'Neil Delpratt
- Priority changed from Low to Normal
- Found in version set to 12.3
I have managed to reproduce the encoding issue on my windows machine. Now investigating.
Updated by O'Neil Delpratt 23 days ago
Just to report back that I tried the option -J-Dfile.encoding=UTF-8
in our native-image build for SaxonC and it works. I am seeing the correct encoding on Windows.
Updated by Martin Honnen 23 days ago
Great to hear.
So it seems Omar kind of both discovered the bug and the possible fix.
Now the other users only need 12.4 with the fix applied.. :).
Updated by O'Neil Delpratt 23 days ago
Ideally I think we should make the patch work as a runtime property instead of compile time. I will investigate how it can be done if possible
Updated by Martin Honnen 23 days ago
Omar,
I don't build SaxonC so I don't know whether all code to to that is online but part of the glue code similar to the string C/Pascal conversions you have shown seems to be in https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_12_3/entry/src/main/java/net/sf/saxon/option/cpp/SaxonCAPI.java#L939, I think.
Updated by Omar Siam 18 days ago
The https://saxonica.plan.io/projects/saxonmirrorhe/repository/he repository contains almost everything needed to rebuild Saxon HE C 12.
The problem is it contains too much of the original Code for PE and EE:
- imports that can not be resolved of course
- code that is behind "IFDEFs" but is not stripped when building
- tests that can not run or can not succeed on HE
There is only one file missing that can not be generated without a commercial license: function-library.sef.xml
. But this file for a particular version can be found in the Saxon J HE 12 jar file.
I cloned the repository and committed my deletions to a branch. If anyone wants to do something with that code feel free to contact me. I am reluctant to push it to a public repo on a service like GitHub.
Updated by Norm Tovey-Walsh 16 days ago
It's supposed to be possible to build with the HE repo, but it's not been tested in a long time. You can't build the files under src/main/java/...
because there's a bunch of preprocessing that has to be done to avoid the PE/EE parts in the HE build. Try the eejRelease
target. I'll take a look when I have a moment.
Updated by O'Neil Delpratt 13 days ago
Hi, Setting the output encoding might be a workaround to your current problem (i.e. setProperty("!encoding", "UTF-8"). The '!' symbol enforces the setting of serialization properties.
For example in C++ calling applyTemplatesReturningString or applyTemplatesReturningFile enforces the use of the properties set on the Saxon serializer. For example
exeuctable->setProperty("!encoding", "UTF-8")
const char *result = executable->applyTemplatesReturningString();
Unfortunately this does not work for applyTemplatesReturningValue()
when you call getStringValue()
as this uses the default platform for encoding and does not involve the serializer.
For more information on properties see: https://www.saxonica.com/saxon-c/documentation12/index.html#!configuration/xslt And for serialization properties see: https://www.saxonica.com/documentation12/index.html#!extensions/output-extras
I am working on the setting of the encoding at the point of inputting of string. This will be available for the next maintenance release.
Updated by Martin Honnen 12 days ago
Should that workaround to set the serialization output encoding also work with Python under Windows?
For me it seems it doesn't, code like
from saxonche import *
xslt1 = '''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:template name="xsl:initial-template">
<root>
<text xml:lang="de">Der Preis ist höher als 300€.</text>
</root>
</xsl:template>
</xsl:stylesheet>
'''
xslt2 = '''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all"
expand-text="yes">
<xsl:param name="text" as="xs:string" required="yes"/>
<xsl:template name="xsl:initial-template">
<root>
<text>{$text}</text>
</root>
</xsl:template>
</xsl:stylesheet>
'''
with PySaxonProcessor(license=False) as saxon:
xslt30_processor = saxon.new_xslt30_processor()
xslt_executable = xslt30_processor.compile_stylesheet(stylesheet_text=xslt1)
xslt_executable.set_property('!encoding', 'UTF-8')
print(xslt_executable.call_template_returning_string())
xslt_executable = xslt30_processor.compile_stylesheet(stylesheet_text=xslt2)
xslt_executable.set_property('!encoding', 'UTF-8')
xslt_executable.set_parameter('text', saxon.make_string_value('Der Preis ist höher als 300€.'))
print(xslt_executable.call_template_returning_string())
outputs e.g.
<?xml version="1.0" encoding="UTF-8"?><root><text xml:lang="de">Der Preis ist höher als 300€.</text></root>
<?xml version="1.0" encoding="UTF-8"?><root><text>Der Preis ist höher als 300€.</text></root>
Updated by O'Neil Delpratt 12 days ago
It does not work in the Python API because we encode the strings to UTF-8 and then at the C++ to Java boundary it uses the platform default. The solution s to allow the user to pass in the encoding of the input string or file. This should be available in the next maintenance release.
Updated by O'Neil Delpratt 5 days ago
- % Done changed from 0 to 90
I have applied a patch to this encoding issue.
The encoding of the output is handled by the property on the Java Serialiser class (i.e. set_property('!encoding', 'UTF-8')
). However for the input string the user can now specify the encoding as an argument. For example:
node = parse_xml(xml_text="<doc><e>Schrödinger</e></doc>", encoding="latin1")
print(node.get_string_value(encoding="latin1"))
At the Java boundary we now use the following Graalvm method to encode string across the boundary:
toJavaString(CCharPointer cString, UnsignedWord length, Charset charset)
This will be available in the next maintenance release. I have done some preliminary testing. Still need to do some testing on the Windows platform
Please register to edit this issue