Bug #6469
closedSXXP0003 when parsing XHTML file with saxoncee 12.5.0
0%
Description
I'm encountering an error when trying to parse an XHTML file using Python saxoncee version 12.5.0. The error message is as follows:
File "python_saxon/saxonc.pyx", line 935, in saxoncee.PySaxonProcessor.parse_xml
saxoncee.PySaxonApiError: SXXP0003: I/O error reported by XML parser processing file:///test.html. Line number: -1
Steps to reproduce:
- Use Python saxoncee 12.5.0
- Attempt to parse an XHTML file (attached)
Expected behavior: The XHTML file should be parsed successfully.
Actual behavior: An I/O error is reported by the XML parser, with no specific line number indicated.
Environment:
- Python version: 3.12
- Operating System: macOS 14.4.1 (23E224)
- saxoncee version: 12.5.0
Can you please help investigate this issue?
Files
Related issues
Updated by Michael Kay 5 months ago
- Project changed from Saxon to SaxonC
- SaxonC Languages Python added
- SaxonC Platforms All added
You'll need to tell us how you are invoking the parsing. It's most likely to be some error in your use of the API, for example supplying the content of the file to an API that expects its URI, or vice versa.
Updated by Gregorio Pellegrino 5 months ago
I have been using the same code for a year and a half on many XHTML files and it has never crashed.
The code is as follows:
from saxoncee import PySaxonProcessor
saxon = PySaxonProcessor(license=True)
saxon_source = saxon.parse_xml(xml_file_name="test.html")
Updated by O'Neil Delpratt 5 months ago
- Fix Committed on Branch 12 added
Hi,
That particular example works for me. I am using the same macOS version with python version 3.12.2. Its a strange one. I wonder if there is some other call to the web happening?
Updated by Michael Kay 5 months ago
- Assignee set to O'Neil Delpratt
Thanks for the additional info. It's useful to know that this code worked previously.
Assigned to O'Neil.
I suspect the "fix committed" flag was set in error, but I'm not sure how to clear it.
Updated by Matt Patterson 5 months ago
- Assignee changed from O'Neil Delpratt to Matt Patterson
I also cannot reproduce this with 12.5.0 and macOS 14.5 using the supplied script.
However, I can see that an HTTP connection to www.w3.org is made to retrieve the XHTML DTD.
If I prevent network connections being made to www.w3.org, then the reported crash occurs.
Off the top of my head, I can't remember if that's the correct behaviour (fetching the DTD by default) or not, but it was also happening in 12.4.0 which I also just tested.
I'll investigate that, but otherwise it looks like you were the victim of w3.org being briefly unreachable.
Updated by Norm Tovey-Walsh 5 months ago
Saxonica Developer Community notifications@plan.io writes:
Off the top of my head, I can't remember if that's the correct behaviour (fetching the DTD by default) or not, but it was also happening in 12.4.0 which I also just tested.
We bundle the resolver classes, so I don’t think that’s the expected behavior.
I'll investigate that, but otherwise it looks like you were the victim of w3.org being briefly unreachable.
Worse than that, the W3C will throttle your IP if you make too many requests so it’s really important to get the resources locally.
Be seeing you,
norm
--
Norm Tovey-Walsh
Saxonica
Updated by Matt Patterson 5 months ago
- Category set to Saxon-C Internals
- Status changed from New to In Progress
- SaxonC Languages All added
- SaxonC Languages deleted (
Python)
There's a definite difference in behaviour between SaxonC and SaxonJ. This is a bug: we should be including, and using, a copy of the XHTML (and other common W3C) DTDs in SaxonC.
Updated by Gregorio Pellegrino 5 months ago
Effectively the error appears after 3 or 4 files pass the checks without problems. Then I have to change IP to make the system keep working. So it seems that the W3C server is blocking the requests.
I think caching copies of the documents is a good solution.
In the meantime can you recommend a hotfix to unblock the work?
Updated by Norm Tovey-Walsh 5 months ago
I think caching copies of the documents is a good solution.
In the meantime can you recommend a hotfix to unblock the work?
If you download xmlresolver-5.2.3-data.jar from the Maven repo:
https://repo1.maven.org/maven2/org/xmlresolver/xmlresolver/5.2.3/
Unjar that somewhere and explicitly point to the org/xmlresolver/catalog.xml file as a catalog, that should work.
Be seeing you,
norm
--
Norm Tovey-Walsh
Saxonica
Updated by Martin Honnen 5 months ago
Gregorio Pellegrino wrote in #note-9:
Effectively the error appears after 3 or 4 files pass the checks without problems. Then I have to change IP to make the system keep working. So it seems that the W3C server is blocking the requests.
I think caching copies of the documents is a good solution.
In the meantime can you recommend a hotfix to unblock the work?
If you can't edit the XHTML files to remove the DOCTYPE referencing the W3C DTDs but don't need them anyway (this means your XHTML doesn't use any character references, for instance, declared in the DTD), with Python you can configure Saxon to tell the Apache parser to not load external DTDs with e.g.
saxon.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false')
Take that only as a workaround, obviously, and it will only work if no DTD declared entity/character references are used (which, however, is common these days where most people just use UTF-8 anyway to write all characters instead of needing e.g. ä
to escape them.
Updated by Martin Honnen 5 months ago
Gregorio Pellegrino wrote in #note-11:
Does this also work in Python?
I have tested that doing e.g.
from saxonche import *
with PySaxonProcessor() as saxon_proc:
print(saxon_proc.version)
saxon_proc.set_catalog(r'org/xmlresolver/catalog.xml')
for i in range(0, 100):
xdm_doc = saxon_proc.parse_xml(xml_file_name='test1.xhtml')
print(xdm_doc.node_kind)
works fine (assuming I unjared that xmlresolver data jar in the current working dir and that way that then contains e.g. a subfolder org with a subfolder xmlresolver with that catalog.xml).
Updated by Gregorio Pellegrino 5 months ago
Martin Honnen wrote in #note-13:
Gregorio Pellegrino wrote in #note-11:
Does this also work in Python?
I have tested that doing e.g.
from saxonche import * with PySaxonProcessor() as saxon_proc: print(saxon_proc.version) saxon_proc.set_catalog(r'org/xmlresolver/catalog.xml') for i in range(0, 100): xdm_doc = saxon_proc.parse_xml(xml_file_name='test1.xhtml') print(xdm_doc.node_kind)
works fine (assuming I unjared that xmlresolver data jar in the current working dir and that way that then contains e.g. a subfolder org with a subfolder xmlresolver with that catalog.xml).
That's the way it works, thank you. The problem is that now I get the error
saxoncee.PySaxonApiError: Source document not found. Line number: -1
When I try to validate the doc:
validator.validate(xdm_node= xdm_doc)
Can it be related to the catalog I entered?
Updated by O'Neil Delpratt 5 months ago
Please can you supply the complete python script with the validator
where it fails. Thanks
Updated by O'Neil Delpratt 5 months ago
Copied across from another bug issue:
saxon = PySaxonProcessor(license=True)
saxon_source = saxon.parse_xml(xml_file_name="test.html")
saxon.set_configuration_property("http://saxon.sf.net/feature/licenseFileLocation", "saxon-license.lic")
saxon.set_configuration_property("xsdversion", "1.1")
saxon.set_catalog("org/xmlresolver/catalog.xml"))
validator = saxon.new_schema_validator()
validator.set_property("report-node", "true")
validator.set_property("verbose", "false")
validator.register_schema(xsd_file="my_schema.xsd")
validator.validate(xdm_node=saxon_source)
Updated by O'Neil Delpratt 5 months ago
Please can you send me your file my_schema.xsd
Updated by Gregorio Pellegrino 5 months ago
I cannot share it publically, may I send it privately via email?
Updated by O'Neil Delpratt 5 months ago
The user sent the xsd file via email and I was able to reproduce the error. Investigating the issue now.
Updated by O'Neil Delpratt 5 months ago
The keyword xdm_node
seems broken. As a workaround you can pass the file directly in the validator
. See the following example which works for me:
validator.validate(file_name="test.html")
Updated by O'Neil Delpratt 5 months ago
- Copied to Bug #6475: saxoncee.PySaxonApiError: Source document not found. Line number: -1 added
Updated by O'Neil Delpratt 5 months ago
I have moved the issue relating to the PySchemaValidator "not finding the source as XdmNode" to a new bug issue #6469.
Updated by Matt Patterson 5 months ago
- Status changed from In Progress to Resolved
- Fix Committed on Branch 12 added
Fixed by changes to build system to ensure that the XML Resolver's local cache of important unchanging documents (like the XHTML1 DTDs) are included in SaxonC.
Fix will be included in the next maintenance release
Please register to edit this issue