Project

Profile

Help

Bug #6353

open

Unicode characters in filenames are causing errors in Windows

Added by Matt Patterson 2 months ago. Updated 2 months ago.

Status:
New
Priority:
Low
Category:
Python
Start date:
2024-02-20
Due date:
% Done:

0%

Estimated time:
Found in version:
12.4.2
Fixed in version:
Platforms:

Description

As reported by a user in this SO post comment: https://stackoverflow.com/questions/77962974/saxon-xslt-processing-thousands-of-xml-files-in-a-complex-tree-structure/77963410?noredirect=1#comment137525632_77963410

Thanks to silfer1200 and Martin Honnen for reporting.

If you try to pass a filename containing a unicode char with a multi-byte representation in UTF-8 into Saxon C's python layer some weird mangling happens and it looks like the string gets decomposed to a bytestream and then recomposed into a string with each byte considered a complete character.

Given a very simple test setup with the following XML file and python script, this will error out every time it's run on Windows. On macOS it's fine.

test.py:

import os
import sys
from saxonche import PySaxonProcessor

dir_path = os.path.dirname(os.path.realpath(__file__))

print(sys.getdefaultencoding())
print(sys.getfilesystemencoding())

with PySaxonProcessor() as saxon_proc:
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
    print(xml)

köln.xml:

<?xml version="1.0" encoding="utf-8"?>
<hello>Köln</hello>

Windows:

(test-venv) C:\Saxonica\unicode>python test.py
utf-8
utf-8
Traceback (most recent call last):
  File "C:\Saxonica\unicode\test.py", line 11, in <module>
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python_saxon\saxonc.pyx", line 868, in saxonche.PySaxonProcessor.parse_xml
saxonche.PySaxonApiError: Unable to resolve <C:\Saxonica\unicode\köln.xml> into a Source. Line number: -1

macOS:

$ python test.py
utf-8
utf-8
<hello>Köln</hello>
<hello>Köln</hello>
Actions #1

Updated by Matt Patterson 2 months ago

  • Assignee changed from O'Neil Delpratt to Matt Patterson
Actions #2

Updated by Martin Honnen 2 months ago

Hi Matt, thanks for looking into this and opening the bug issue. It covers only one part of the issues encountered in that StackOverflow thread and discussion, although the underlying cause is probably related and perhaps if you have already identified the cause and solution the other part (using a non ASCII character in an output file name doesn't give an error but mangles the file name) will be resolved as well. But do you need an issue on the output file name mangling as well? For the input file handling I have found a workaround from the Python API but for the output file I don't know how to get the current Saxon release (on Windows) to output the file with the wanted non ASCII character(s) instead of mangling them.

Actions #3

Updated by Matt Patterson 2 months ago

Thanks Martin - the issue with output files has the same cause, and I don't think they need to be separated out at the moment (that may change depending on whether my proposed fix is viable for the C++ layer).

Please register to edit this issue

Also available in: Atom PDF