Bug #6353
openUnicode characters in filenames are causing errors in Windows
0%
Description
As reported by a user in this SO post comment: https://stackoverflow.com/questions/77962974/saxon-xslt-processing-thousands-of-xml-files-in-a-complex-tree-structure/77963410?noredirect=1#comment137525632_77963410
Thanks to silfer1200 and Martin Honnen for reporting.
If you try to pass a filename containing a unicode char with a multi-byte representation in UTF-8 into Saxon C's python layer some weird mangling happens and it looks like the string gets decomposed to a bytestream and then recomposed into a string with each byte considered a complete character.
Given a very simple test setup with the following XML file and python script, this will error out every time it's run on Windows. On macOS it's fine.
test.py
:
import os
import sys
from saxonche import PySaxonProcessor
dir_path = os.path.dirname(os.path.realpath(__file__))
print(sys.getdefaultencoding())
print(sys.getfilesystemencoding())
with PySaxonProcessor() as saxon_proc:
xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
print(xml)
köln.xml
:
<?xml version="1.0" encoding="utf-8"?>
<hello>Köln</hello>
Windows:
(test-venv) C:\Saxonica\unicode>python test.py
utf-8
utf-8
Traceback (most recent call last):
File "C:\Saxonica\unicode\test.py", line 11, in <module>
xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python_saxon\saxonc.pyx", line 868, in saxonche.PySaxonProcessor.parse_xml
saxonche.PySaxonApiError: Unable to resolve <C:\Saxonica\unicode\köln.xml> into a Source. Line number: -1
macOS:
$ python test.py
utf-8
utf-8
<hello>Köln</hello>
<hello>Köln</hello>
Please register to edit this issue