Project

Profile

Help

Bug #6353

open

Unicode characters in filenames are causing errors in Windows

Added by Matt Patterson 5 months ago. Updated 13 days ago.

Status:
In Progress
Priority:
Low
Category:
Python
Start date:
2024-02-20
Due date:
% Done:

0%

Estimated time:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Found in version:
12.4.2
Fixed in version:
12.5
SaxonC Languages:
SaxonC Platforms:
SaxonC Architecture:

Description

As reported by a user in this SO post comment: https://stackoverflow.com/questions/77962974/saxon-xslt-processing-thousands-of-xml-files-in-a-complex-tree-structure/77963410?noredirect=1#comment137525632_77963410

Thanks to silfer1200 and Martin Honnen for reporting.

If you try to pass a filename containing a unicode char with a multi-byte representation in UTF-8 into Saxon C's python layer some weird mangling happens and it looks like the string gets decomposed to a bytestream and then recomposed into a string with each byte considered a complete character.

Given a very simple test setup with the following XML file and python script, this will error out every time it's run on Windows. On macOS it's fine.

test.py:

import os
import sys
from saxonche import PySaxonProcessor

dir_path = os.path.dirname(os.path.realpath(__file__))

print(sys.getdefaultencoding())
print(sys.getfilesystemencoding())

with PySaxonProcessor() as saxon_proc:
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
    print(xml)

köln.xml:

<?xml version="1.0" encoding="utf-8"?>
<hello>Köln</hello>

Windows:

(test-venv) C:\Saxonica\unicode>python test.py
utf-8
utf-8
Traceback (most recent call last):
  File "C:\Saxonica\unicode\test.py", line 11, in <module>
    xml = saxon_proc.parse_xml(xml_file_name=os.path.join(dir_path, 'köln.xml'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python_saxon\saxonc.pyx", line 868, in saxonche.PySaxonProcessor.parse_xml
saxonche.PySaxonApiError: Unable to resolve <C:\Saxonica\unicode\köln.xml> into a Source. Line number: -1

macOS:

$ python test.py
utf-8
utf-8
<hello>Köln</hello>
<hello>Köln</hello>
Actions #1

Updated by Matt Patterson 5 months ago

  • Assignee changed from O'Neil Delpratt to Matt Patterson
Actions #2

Updated by Martin Honnen 5 months ago

Hi Matt, thanks for looking into this and opening the bug issue. It covers only one part of the issues encountered in that StackOverflow thread and discussion, although the underlying cause is probably related and perhaps if you have already identified the cause and solution the other part (using a non ASCII character in an output file name doesn't give an error but mangles the file name) will be resolved as well. But do you need an issue on the output file name mangling as well? For the input file handling I have found a workaround from the Python API but for the output file I don't know how to get the current Saxon release (on Windows) to output the file with the wanted non ASCII character(s) instead of mangling them.

Actions #3

Updated by Matt Patterson 5 months ago

Thanks Martin - the issue with output files has the same cause, and I don't think they need to be separated out at the moment (that may change depending on whether my proposed fix is viable for the C++ layer).

Actions #5

Updated by Matt Patterson about 2 months ago

  • Status changed from New to In Progress

I can confirm that the problem has been masked by changes to build settings, and this will work in the next maintenance release. The underlying problem (not explicitly passing encoding for all strings) still needs some work to ensure there are no places that still happens.

Actions #6

Updated by Matt Patterson 13 days ago

  • Fixed in version set to 12.5

This shouldn't manifest in the 12.5 Maintenance release, but we're leaving this issue open until we've fully revisited the underlying string-handling issue.

Please register to edit this issue

Also available in: Atom PDF