Project

Profile

Help

Bug #6353

open

Unicode characters in filenames are causing errors in Windows

Added by Matt Patterson 9 months ago. Updated 5 months ago.

Status:
In Progress
Priority:
Low
Category:
Python
Start date:
2024-02-20
Due date:
% Done:

0%

Estimated time:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Found in version:
12.4.2
Fixed in version:
12.5
SaxonC Languages:
SaxonC Platforms:
SaxonC Architecture:

Description

As reported by a user in this SO post comment: https://stackoverflow.com/questions/77962974/saxon-xslt-processing-thousands-of-xml-files-in-a-complex-tree-structure/77963410?noredirect=1#comment137525632_77963410

Thanks to silfer1200 and Martin Honnen for reporting.

If you try to pass a filename containing a unicode char with a multi-byte representation in UTF-8 into Saxon C's python layer some weird mangling happens and it looks like the string gets decomposed to a bytestream and then recomposed into a string with each byte considered a complete character.

Given a very simple test setup with the following XML file and python script, this will error out every time it's run on Windows. On macOS it's fine.

test.py:

import os
import sys
from saxonche import PySaxonProcessor

dir_path = os.path.dirname(os.path.realpath(__file__))

print(sys.getdefaultencoding())
print(sys.getfilesystemencoding())

with PySaxonProcessor() as saxon_proc:
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
    print(xml)

köln.xml:

<?xml version="1.0" encoding="utf-8"?>
<hello>Köln</hello>

Windows:

(test-venv) C:\Saxonica\unicode>python test.py
utf-8
utf-8
Traceback (most recent call last):
  File "C:\Saxonica\unicode\test.py", line 11, in <module>
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python_saxon\saxonc.pyx", line 868, in saxonche.PySaxonProcessor.parse_xml
saxonche.PySaxonApiError: Unable to resolve <C:\Saxonica\unicode\köln.xml> into a Source. Line number: -1

macOS:

$ python test.py
utf-8
utf-8
<hello>Köln</hello>
<hello>Köln</hello>

Please register to edit this issue

Also available in: Atom PDF