Project

Profile

Help

Bug #6182

open

UTF-8 in string based C API functions

Added by Omar Siam 11 months ago. Updated 7 months ago.

Status:
In Progress
Priority:
Normal
Category:
Saxon-C Internals
Start date:
2023-08-22
Due date:
% Done:

90%

Estimated time:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Found in version:
12.3
Fixed in version:
SaxonC Languages:
SaxonC Platforms:
SaxonC Architecture:

Description

I tried to get the following code running and the encoding of the return value seems off.

void testUTF8StringTemplate(SaxonProcessor *proc, Xslt30Processor *trans,
                         sResultCount *sresult) {

  const char *source =
      "<?xml version='1.0' encoding='UTF8'?>  <xsl:stylesheet "
      "xmlns:xsl='http://www.w3.org/1999/XSL/Transform'  "
      "xmlns:xs='http://www.w3.org/2001/XMLSchema'  version='3.0'>  "
      "<xsl:template match='*'>     <xsl:sequence select='&apos;تيست&apos;'/>  </xsl:template>  </xsl:stylesheet>";
  cout << endl << "Test:testUTF8StringTemplate" << endl;
  XsltExecutable *executable = trans->compileFromString(source);
  if (executable == nullptr) {
    if (trans->exceptionOccurred()) {
      cout << "Error: " << trans->getErrorMessage() << endl;
    }
    return;
  }
  const char* _in = "<?xml version='1.0' encoding='UTF8'?><e>تيست</e>";
  XdmNode *node = proc->parseXmlFromString(_in);
  executable->setResultAsRawValue(false);
  std::map<std::string, XdmValue *> parameterValues;

  executable->setInitialTemplateParameters(parameterValues, false);
  executable->setInitialMatchSelection(node);
  XdmValue *result = executable->applyTemplatesReturningValue();
  if (result != nullptr) {
    sresult->success++;
    cout << "Input=" << _in;
    cout << "Result=" << result->getHead()->getStringValue() << endl << node->toString() << endl;
    delete result;
  } else {
    sresult->failure++;
    sresult->failureList.push_back("testUTF8StringTemplate");
  }
  delete executable;
  delete node;
  parameterValues.clear();
}
Compiled with VS 2017:
cl /utf-8 /EHsc "-I%graalvmdir%"  testXSLT30.cpp ../../Saxon.C.API/SaxonCGlue.c ../../Saxon.C.API/SaxonCXPath.c  ../../Saxon.C.API/SaxonProcessor.cpp ../../Saxon.C.API/XdmValue.cpp ../../Saxon.C.API/XdmItem.cpp ../../Saxon.C.API/XdmAtomicValue.cpp ../../Saxon.C.API/DocumentBuilder.cpp ../../Saxon.C.API/XdmNode.cpp ../../Saxon.C.API/XdmFunctionItem.cpp ../../Saxon.C.API/XdmArray.cpp ../../Saxon.C.API/XdmMap.cpp ../../Saxon.C.API/SaxonApiException.cpp ../../Saxon.C.API/XQueryProcessor.cpp ../../Saxon.C.API/Xslt30Processor.cpp ../../Saxon.C.API/XsltExecutable.cpp ../../Saxon.C.API/XPathProcessor.cpp ../../Saxon.C.API/SchemaValidator.cpp /link ..\..\libs\win\libsaxon-hec-12.3.lib

Result in an UTF-8 enabled powershell window:

Test:testUTF8StringTemplate
Input=<?xml version='1.0' encoding='UTF8'?><e>تيست</e>Result=تيست
<e>تيست</e>

Any ideas how to fix this? It seems the UTF-8 string is encoded twice.

Actions #1

Updated by Norm Tovey-Walsh 11 months ago

Thanks for the report. I've been unable to reproduce the results with either SaxonC HE or EE 12.3. Unfortunately, I'm running on a Mac/arm64 machine, not Windows. Testing on Windows will take a bit longer.

Are you using 12.3?

Actions #2

Updated by Omar Siam 11 months ago

Yes that is with the latest 12.3 Yes, I expected that. I was not able to test this in something else than Windows yet but I assumed it would just work.
Windows is now the only operating system of MacOS, Linux, Windows that does not use UTF-8 as 8-bit character encoding but whatever codepage they used back in the 90s in your region of the world. For me this is Windows 1252. This is then in turn used by default in Java on Windows. There are a few problem reports, also on Graal VM from other projects (e.g. quarkus using GraalVM native-image) that had strange effects on Windows only.

Actions #3

Updated by Omar Siam 11 months ago

Tried to make sure to use the correct codepage in console output on Windows by adding one line at the start of main in testXSLT30.cpp:

int main(int argc, char *argv[]) {
  SetConsoleOutputCP(65001);

And for PowerShell to render UTF-8

[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Actions #4

Updated by Omar Siam 11 months ago

Just tried test_saxonc.py from libsaxon-HEC-windows-amd64-v12.3\pypi using pipenv with this Pipfile:

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
pytest = "*"
saxonche = "12.3"

[dev-packages]

[requires]
python_version = "3.11"
pipenv run python -Xutf8 -m pytest test_saxonc.py

In the result there is on failing test that shows this bug: AssertionError: assert '<!--A-->§<out/>§<!--Z-->' == '<!--A-->§<out/>§<!--Z-->'

================================================= test session starts =================================================
platform win32 -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0
rootdir: Q:\libsaxon-HEC-windows-amd64-v12.3\pypi
[...]
=============================================== short test summary info ===============================================
FAILED test_saxonc.py::testNodeAxis - TypeError: can only concatenate str (not "NoneType") to str
FAILED test_saxonc.py::testCollection - NameError: name 'getcwd' is not defined
FAILED test_saxonc.py::testXquery_40_functions - saxonche.PySaxonApiError: Requested feature (XQuery 4.0) requires Saxon-PE. Line number: -1
FAILED test_saxonc.py::testXdmDestinationWithItemSeparator - AssertionError: assert '<!--A-->§<out/>§<!--Z-->' == '<!--A-->§<out/>§<!--Z-->'
FAILED test_saxonc.py::test_parse_xml_file1 - saxonche.PySaxonApiError: SXXP0003: I/O error reported by XML parser processing /Q:/libsaxon-HEC-windows-amd64-v12....
FAILED test_saxonc.py::test_packages - saxonche.PySaxonApiError: Exporting a stylesheet requires Saxon-EE. Line number: -1
======================================= 6 failed, 79 passed, 2 skipped in 1.31s =======================================
Actions #5

Updated by Omar Siam 11 months ago

Python test case with arabic letters:

´python def testUTF8(saxonproc): node = saxonproc.parse_xml(xml_text="تيست") trans = saxonproc.new_xslt30_processor() executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>") assert node is not None assert isinstance(node, PyXdmNode) assert len(node.children)>0 eNode = node.children[0].children[0] assert eNode is not None executable.set_global_context_item(xdm_item=node) executable.set_initial_match_selection(xdm_value=eNode) result = executable.apply_templates_returning_string() assert result is not None assert "UTF8-تيست: تيست" in result

Result:

FAILED test_saxonc.py::testUTF8 - assert 'UTF8-تيست: تيست' in 'UTF8-تيست: تيست'

Actions #6

Updated by Omar Siam 11 months ago

Python test case with arabic letters:

def testUTF8(saxonproc):
    node = saxonproc.parse_xml(xml_text="<doc><e>تيست</e></doc>")
    trans = saxonproc.new_xslt30_processor()
    executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>")
    assert node is not None
    assert isinstance(node, PyXdmNode)
    assert len(node.children)>0
    eNode = node.children[0].children[0]
    assert eNode is not None
    executable.set_global_context_item(xdm_item=node)
    executable.set_initial_match_selection(xdm_value=eNode)
    result = executable.apply_templates_returning_string()
    assert result is not None
    assert "UTF8-تيست: تيست" in result

Result:

FAILED test_saxonc.py::testUTF8 - assert 'UTF8-تيست: تيست' in '<?xml version="1.0" encoding="UTF-8"?>UTF8-تيست: تيست'
Actions #7

Updated by Martin Honnen 11 months ago

I can confirm the problem under Windows; it doesn't occur with Linux Ubuntu in the WSL shell.

The damage seems to be already done on parsing as e.g.

    node = saxon.parse_xml(xml_text="<root><text xml:lang='de'>Der Preis ist höher als 300€</text></root>")
    print(node)

prints

<root>
   <text xml:lang="de">Der Preis ist höher als 300€</text>
</root>

The commands directly, however, like Query.exe, seem to be able to handle the characters e.g.

'C:\Program Files\Saxonica\libsaxon-HEC-windows-amd64-v12.3\command\Query.exe' -qs:"<root>Der Preis ist höher als 300€.</root>"   
<?xml version="1.0" encoding="UTF-8"?><root>Der Preis ist höher als 300€.</root>
Actions #8

Updated by Martin Honnen 11 months ago

All SaxonC (12.0, 12.1, 12.2, 12.3) releases for Windows seem to show that problem.

Actions #9

Updated by Omar Siam 11 months ago

OK so I found interfacing GraalVM interesting in itself. I created a test case for my target language FreePascal here
I am guessing here because I could not find the sources for the glue code used in the Saxon-C version of Saxon HE. The Java part is:

import org.graalvm.nativeimage.IsolateThread;
import org.graalvm.nativeimage.c.function.CEntryPoint;
import org.graalvm.nativeimage.c.type.CCharPointer;
import org.graalvm.nativeimage.c.type.CTypeConversion;
 
public class Main {
    public static void main(String[] args) {
        System.out.println("file.encoding:" + System.getProperty("file.encoding"));
        System.out.println("UTF-8 text with some international chars:");
        System.out.println("äëïöü áéíóú àèiòù Ññ Çç € تيست");
    }
    @CEntryPoint(name = "print_and_return")
    static CCharPointer print_and_return(IsolateThread thread, CCharPointer _in) {
        String in = CTypeConversion.toJavaString(_in);
        System.out.println("file.encoding:" + System.getProperty("file.encoding"));
        System.out.println("String passed as input is:");
        System.out.println(in);
        return CTypeConversion.toCString(in).get();
    }
    @CEntryPoint(name = "add_utf8_print_and_return")
    static CCharPointer add_utf8_print_and_return(IsolateThread thread, CCharPointer _in) {
        String in = CTypeConversion.toJavaString(_in);
        System.out.println("file.encoding:" + System.getProperty("file.encoding"));
        System.out.println("String passed as input is:");
        System.out.println(in);
        return CTypeConversion.toCString("“" + in  + "”").get();
    }
}

I just followed the examples and the JavaDoc

main thread pointer 00000000000C1E40\n
Cur attach thread pointer same 00000000000C1E40\n
file.encoding:UTF-8
String passed as input is:
A Test
file.encoding:UTF-8
String passed as input is:
äöü߀
file.encoding:UTF-8
String passed as input is:
A Test
file.encoding:UTF-8
String passed as input is:
äöü߀
!

All Tests [00.009] : 4 ok
  Graal UTF-8 Tests [00.008] : 4 ok
     [00.008] : 4 ok
      Graal string without UTF-8: A Test [00.005] pass
      Graal string with UTF-8: äöü߀ [00.001] pass
      Graal string without UTF-8, Java adds UTF-8: “A Test” [00.001] pass
      Graal string with UTF-8, Java adds UTF-8: “äöü߀” [00.001] pass

To get a result with my code that looks correct I have to:

  1. Create the DLL with -J-Dfile.encoding=UTF-8. Else at least some output is broken.
  2. I have to set the Console codepage to 65001: chcp 65001
main thread pointer 0000000001581E80\n
Cur attach thread pointer same 0000000001581E80\n
file.encoding:Cp1252
String passed as input is:
A Test
file.encoding:Cp1252
String passed as input is:
äöü߀
file.encoding:Cp1252
String passed as input is:
A Test
file.encoding:Cp1252
String passed as input is:
äöü߀
!

All Tests [00.007] : 2 ok, 2 failed
  Graal UTF-8 Tests [00.006] : 2 ok, 2 failed
     [00.006] : 2 ok, 2 failed
      Graal string without UTF-8: A Test [00.003] pass
      Graal string with UTF-8: äöü߀ [00.001] pass
      Graal string without UTF-8, Java adds UTF-8: “A Test” [00.001] fail @?#?: "Result should match input:A Test" expected: <“A Test”> but was: <?A Test?>
      Graal string with UTF-8, Java adds UTF-8: “äöü߀” [00.001] fail @?#?: "Result should match input:äöü߀" expected: <“äöü߀”> but was: <?äöü߀?>
Actions #10

Updated by Omar Siam 11 months ago

I also see differences between building with different code pages active:

chcp 65001
native-image -H:Name=libtestutf8 -J-Dfile.encoding=UTF-8 --shared

(looks best to me) and

chcp 1252
native-image -H:Name=libtestutf8 -J-Dfile.encoding=UTF-8 --shared
Actions #11

Updated by O'Neil Delpratt 11 months ago

Hi,

Apologies I have been away last week so not been to take the lead on this issue. Thanks for everyones comments. I will investigate it further to see how we can correctly handle the encoding on Windows.

Actions #12

Updated by O'Neil Delpratt 11 months ago

  • Category set to Saxon-C Internals
  • Status changed from New to In Progress
  • Assignee set to O'Neil Delpratt
  • Priority changed from Low to Normal
  • Found in version set to 12.3

I have managed to reproduce the encoding issue on my windows machine. Now investigating.

Actions #13

Updated by O'Neil Delpratt 11 months ago

Just to report back that I tried the option -J-Dfile.encoding=UTF-8 in our native-image build for SaxonC and it works. I am seeing the correct encoding on Windows.

Actions #14

Updated by Martin Honnen 11 months ago

Great to hear.

So it seems Omar kind of both discovered the bug and the possible fix.

Now the other users only need 12.4 with the fix applied.. :).

Actions #15

Updated by O'Neil Delpratt 11 months ago

Ideally I think we should make the patch work as a runtime property instead of compile time. I will investigate how it can be done if possible

Actions #16

Updated by Omar Siam 11 months ago

So did I get that correctly: The Java glue code for the Saxon-HE-C library is not publicly available? Because I would go with the compile time fix for now but I can't recreate libsaxon-hec-12.3.dll.

Actions #17

Updated by Martin Honnen 11 months ago

Omar,

I don't build SaxonC so I don't know whether all code to to that is online but part of the glue code similar to the string C/Pascal conversions you have shown seems to be in https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_12_3/entry/src/main/java/net/sf/saxon/option/cpp/SaxonCAPI.java#L939, I think.

Actions #18

Updated by Omar Siam 11 months ago

Thanks for pointing me in the right direction! I hope you find a more flexible solution. I think on Windows it will be necessary to communicate the character set depending on the needs of the application that calls the DLL.

Actions #19

Updated by Omar Siam 11 months ago

The https://saxonica.plan.io/projects/saxonmirrorhe/repository/he repository contains almost everything needed to rebuild Saxon HE C 12.
The problem is it contains too much of the original Code for PE and EE:

  • imports that can not be resolved of course
  • code that is behind "IFDEFs" but is not stripped when building
  • tests that can not run or can not succeed on HE

There is only one file missing that can not be generated without a commercial license: function-library.sef.xml. But this file for a particular version can be found in the Saxon J HE 12 jar file.
I cloned the repository and committed my deletions to a branch. If anyone wants to do something with that code feel free to contact me. I am reluctant to push it to a public repo on a service like GitHub.

Actions #20

Updated by Norm Tovey-Walsh 11 months ago

It's supposed to be possible to build with the HE repo, but it's not been tested in a long time. You can't build the files under src/main/java/... because there's a bunch of preprocessing that has to be done to avoid the PE/EE parts in the HE build. Try the eejRelease target. I'll take a look when I have a moment.

Actions #21

Updated by O'Neil Delpratt 10 months ago

Hi, Setting the output encoding might be a workaround to your current problem (i.e. setProperty("!encoding", "UTF-8"). The '!' symbol enforces the setting of serialization properties.

For example in C++ calling applyTemplatesReturningString or applyTemplatesReturningFile enforces the use of the properties set on the Saxon serializer. For example

exeuctable->setProperty("!encoding", "UTF-8")
const char *result = executable->applyTemplatesReturningString();

Unfortunately this does not work for applyTemplatesReturningValue() when you call getStringValue() as this uses the default platform for encoding and does not involve the serializer.

For more information on properties see: https://www.saxonica.com/saxon-c/documentation12/index.html#!configuration/xslt And for serialization properties see: https://www.saxonica.com/documentation12/index.html#!extensions/output-extras

I am working on the setting of the encoding at the point of inputting of string. This will be available for the next maintenance release.

Actions #22

Updated by Martin Honnen 10 months ago

Should that workaround to set the serialization output encoding also work with Python under Windows?

For me it seems it doesn't, code like

from saxonche import *

xslt1 = '''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
  <xsl:template name="xsl:initial-template">
    <root>
      <text xml:lang="de">Der Preis ist höher als 300€.</text>
    </root>
  </xsl:template>
</xsl:stylesheet>
'''

xslt2 = '''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all"
  expand-text="yes">
  <xsl:param name="text" as="xs:string" required="yes"/>
  <xsl:template name="xsl:initial-template">
    <root>
      <text>{$text}</text>
    </root>
  </xsl:template>
</xsl:stylesheet>
'''

with PySaxonProcessor(license=False) as saxon:
    xslt30_processor = saxon.new_xslt30_processor()

    xslt_executable = xslt30_processor.compile_stylesheet(stylesheet_text=xslt1)

    xslt_executable.set_property('!encoding', 'UTF-8')

    print(xslt_executable.call_template_returning_string())

    xslt_executable = xslt30_processor.compile_stylesheet(stylesheet_text=xslt2)
    xslt_executable.set_property('!encoding', 'UTF-8')
    xslt_executable.set_parameter('text', saxon.make_string_value('Der Preis ist höher als 300€.'))

    print(xslt_executable.call_template_returning_string())


outputs e.g.

<?xml version="1.0" encoding="UTF-8"?><root><text xml:lang="de">Der Preis ist höher als 300€.</text></root>
<?xml version="1.0" encoding="UTF-8"?><root><text>Der Preis ist höher als 300€.</text></root>
Actions #23

Updated by O'Neil Delpratt 10 months ago

It does not work in the Python API because we encode the strings to UTF-8 and then at the C++ to Java boundary it uses the platform default. The solution s to allow the user to pass in the encoding of the input string or file. This should be available in the next maintenance release.

Actions #24

Updated by O'Neil Delpratt 10 months ago

  • % Done changed from 0 to 90

I have applied a patch to this encoding issue.

The encoding of the output is handled by the property on the Java Serialiser class (i.e. set_property('!encoding', 'UTF-8')). However for the input string the user can now specify the encoding as an argument. For example:

node = parse_xml(xml_text="<doc><e>Schrödinger</e></doc>", encoding="latin1")
print(node.get_string_value(encoding="latin1"))

At the Java boundary we now use the following Graalvm method to encode string across the boundary:

toJavaString(CCharPointer cString, UnsignedWord length, Charset charset)

This will be available in the next maintenance release. I have done some preliminary testing. Still need to do some testing on the Windows platform

Actions #25

Updated by Martin Honnen 8 months ago

What should be the default encoding used with Python and SaxonC HE 12.4 under Windows?

Because

    node = saxon_proc.parse_xml(xml_text="<root><text xml:lang='de'>Der Preis ist höher als 300€</text></root>")

    print(node)

still gives

<root>
   <text xml:lang="de">Der Preis ist höher als 300€</text>
</root>

only explicitly doing

    node = saxon_proc.parse_xml(xml_text="<root><text xml:lang='de'>Der Preis ist höher als 300€</text></root>", encoding="utf8")

    print(node)

gives me the wanted output

<root>
   <text xml:lang="de">Der Preis ist höher als 300€</text>
</root>

The Python code file is encoded by the IDE as UTF-8.

Actions #26

Updated by Martin Honnen 8 months ago

Also looking further through the Saxon 12.4 API documentation, I kind of wonder whether the API for XQuery and XPathProcessor shouldn't also have been updated to take an encoding parameter when compiling code from a string. It seems that the problem identified for strings and change done on XSLT compilation from a string as well as XdmAtomicValue creation from a string as well as XML parsing from a string kind of is also needed for XPath or XQuery compilation/evaluation where the code is passed in as a string.

Or is that somehow convered elsewhere?

Actions #27

Updated by O'Neil Delpratt 8 months ago

Martin Honnen wrote in #note-25:

What should be the default encoding used with Python and SaxonC HE 12.4 under Windows?

The Python code file is encoded by the IDE as UTF-8.

The default might be platform related and not via just the IDE.

Actions #28

Updated by O'Neil Delpratt 8 months ago

Martin Honnen wrote in #note-26:

Also looking further through the Saxon 12.4 API documentation, I kind of wonder whether the API for XQuery and XPathProcessor shouldn't also have been updated to take an encoding parameter when compiling code from a string. It seems that the problem identified for strings and change done on XSLT compilation from a string as well as XdmAtomicValue creation from a string as well as XML parsing from a string kind of is also needed for XPath or XQuery compilation/evaluation where the code is passed in as a string.

Or is that somehow convered elsewhere?

Yes I agree with your comment about XPath and XQuery. I will make the changes widespread across the other processors.

For the XdmAtomicValue creation I think it already supports the passing of the encoding keyword. i.e. make_string_value.

What is clear is the documentation needs updating to include details about the encoding.

Actions #29

Updated by Martin Honnen 8 months ago

It seems the C++ DocumentBuilder has support to take an encoding parameter in https://www.saxonica.com/saxon-c/doc12/html/classDocumentBuilder.html#a5a0d41549f95d99648174d87bba35bcd, however both the API documentation for the Python DocumentBuilder https://www.saxonica.com/saxon-c/doc12/html/saxonc.html#PyDocumentBuilder-parse_xml and a quick test suggest that parse_xml function in the Python API of 12.4 still lacks an encoding parameter.

from saxonche import PySaxonProcessor

with PySaxonProcessor(license=False) as saxon_proc:
    print(saxon_proc.version)

    doc_builder = saxon_proc.new_document_builder()

    xml_markup = """<root>
    <p xml:lang="en">This is a test: price is 300 €.</p>
    <p xml:lang="de">Dies ist ein Test: der Preis ist höher als 300 €.</p>
</root>"""

    xdm_node1 = saxon_proc.parse_xml(xml_text=xml_markup, encoding="utf-8")

    print(xdm_node1)

    xdm_node2 = doc_builder.parse_xml(xml_text=xml_markup, encoding="utf-8")

    print(xdm_node2)

outputs

SaxonC-HE 12.4 from Saxonica
<root>
   <p xml:lang="en">This is a test: price is 300 €.</p>
   <p xml:lang="de">Dies ist ein Test: der Preis ist höher als 300 €.</p>
</root>
Traceback (most recent call last):
  File "C:\Users\marti\PycharmProjects\SaxonCHE124DocBuilderParseXmlEncTest1\main.py", line 17, in <module>
    xdm_node2 = doc_builder.parse_xml(xml_text=xml_markup, encoding="utf-8")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python_saxon\saxonc.pyx", line 1071, in saxonche.PyDocumentBuilder.parse_xml
Exception: Error: parse_xml should only contain one of the following keyword arguments: (xml_file_name|xml_text|xml_uri)

Let me know if you need a separate bug issue on this or you can fix that while trying to add encoding params where needed.

Actions #30

Updated by O'Neil Delpratt 8 months ago

Bug relating to comment #29 now fixed

Actions #31

Updated by Omar Siam 8 months ago

Could you please fix the test_saxonc.py so it does not fail on windows anymore with the encoding problem?

--- test_saxonc.py.old  2023-12-05 16:59:10.484298500 +0100
+++ test_saxonc.py      2023-12-05 16:52:50.453313200 +0100
@@ -1,3 +1,4 @@
+# -*- coding: utf-8 -*-
 from tempfile import mkstemp
 import pytest
 from saxonche import *
@@ -241,9 +242,9 @@
             assert False

 def testUTF8(saxonproc):
-    node = saxonproc.parse_xml(xml_text="<doc><e>تيست</e></doc>")
+    node = saxonproc.parse_xml(xml_text="<doc><e>تيست</e></doc>", encoding="UTF-8")
     trans = saxonproc.new_xslt30_processor()
-    executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>")
+    executable = trans.compile_stylesheet(stylesheet_text="<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template match='e'>UTF8-تيست: <xsl:value-of select='.'/></xsl:template></xsl:stylesheet>", encoding="UTF-8")
     assert node is not None
     assert isinstance(node, PyXdmNode)
     assert len(node.children)>0
@@ -251,6 +252,7 @@
     assert eNode is not None
     executable.set_global_context_item(xdm_item=node)
     executable.set_initial_match_selection(xdm_value=eNode)
+    executable.set_property("!encoding", "UTF-8")
     result = executable.apply_templates_returning_string()
     assert result is not None
     assert "UTF8-تيست: تيست" in result
@@ -509,7 +511,7 @@
 def testXdmDestinationWithItemSeparator(saxonproc):
     trans = saxonproc.new_xslt30_processor()
     stylesheetStr = "<xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'><xsl:template name='go'><xsl:comment>A</xsl:comment><out/><xsl:comment>Z</xsl:comment></xsl:template><xsl:output method='xml' item-separator='§'/></xsl:stylesheet>"
-    executable = trans.compile_stylesheet(stylesheet_text=stylesheetStr)
+    executable = trans.compile_stylesheet(stylesheet_text=stylesheetStr, encoding="UTF-8")
     root = executable.call_template_returning_value("go")
     node  = root.head
Actions #32

Updated by O'Neil Delpratt 8 months ago

Yes sure. I have made the change

Actions #33

Updated by Omar Siam 7 months ago

Works for me now. Changed my FreePascal implementation of the glue code to work with Saxon C HE 12.4.1. Works as expected. Found another place where a C++ "Test" does not work as the new settings are missing.

diff --git "a/V:\\libsaxon-HEC-windows-amd64-v12.4.1\\samples\\cppTests\\testXSLT30.cpp.old" "b/V:\\libsaxon-HEC-windows-amd64-v12.4.1\\samples\\cppTests\\testXSLT30.cpp"
index b1e7cba..14d325d 100644
--- "a/V:\\libsaxon-HEC-windows-amd64-v12.4.1\\samples\\cppTests\\testXSLT30.cpp.old"
+++ "b/V:\\libsaxon-HEC-windows-amd64-v12.4.1\\samples\\cppTests\\testXSLT30.cpp"
@@ -2806,7 +2806,7 @@ void testUTF8StringTemplate(SaxonProcessor *proc, Xslt30Processor *trans,
       "xmlns:xs='http://www.w3.org/2001/XMLSchema'  version='3.0'>  "
       "<xsl:template match='*'>     <xsl:sequence select='&apos;تيست&apos;'/>  </xsl:template>  </xsl:stylesheet>";
   cout << endl << "Test:testUTF8StringTemplate" << endl;
-  XsltExecutable *executable = trans->compileFromString(source);
+  XsltExecutable *executable = trans->compileFromString(source, "UTF-8");
   if (executable == nullptr) {
     if (trans->exceptionOccurred()) {
       cout << "Error: " << trans->getErrorMessage() << endl;
@@ -2814,12 +2814,13 @@ void testUTF8StringTemplate(SaxonProcessor *proc, Xslt30Processor *trans,
     return;
   }
   const char* _in = "<?xml version='1.0' encoding='UTF8'?><e>تيست</e>";
-  XdmNode *node = proc->parseXmlFromString(_in);
+  XdmNode *node = proc->parseXmlFromString(_in, "UTF-8");
   executable->setResultAsRawValue(false);
   std::map<std::string, XdmValue *> parameterValues;
 
   executable->setInitialTemplateParameters(parameterValues, false);
   executable->setInitialMatchSelection(node);
+  executable->setProperty("!encoding", "UTF-8");
   XdmValue *result = executable->applyTemplatesReturningValue();
   if (result != nullptr) {
     sresult->success++;
Actions #34

Updated by O'Neil Delpratt 7 months ago

Thanks for the feedback.

Martin Honnen wrote in #note-26:

Also looking further through the Saxon 12.4 API documentation, I kind of wonder whether the API for XQuery and XPathProcessor shouldn't also have been updated to take an encoding parameter when compiling code from a string. It seems that the problem identified for strings and change done on XSLT compilation from a string as well as XdmAtomicValue creation from a string as well as XML parsing from a string kind of is also needed for XPath or XQuery compilation/evaluation where the code is passed in as a string.

Or is that somehow convered elsewhere?

The XQuery and XPath Processors have now been updated to accept the encoding parameter.

Still testing before we close of this bug issue.

Please register to edit this issue

Also available in: Atom PDF