setting parser features for the XML parser SaxonC uses
With some help on Slack I learned that Apache Xerces has a parser feature
http://apache.org/xml/features/nonvalidating/load-external-dtd you can set to false to avoid loading the DTD.
I have tested that SaxonJ HE 12.0 allows me doing that for its Configuration with Java 11 doing
Processor processor = new Processor(); Configuration configuration = processor.getUnderlyingConfiguration(); configuration.setParseOptions(configuration.getParseOptions().withParserFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)); DocumentBuilder docBuilder = processor.newDocumentBuilder(); XdmNode inputDoc = docBuilder.build(new File("sample1.xml"));
I take it that SaxonC somehow currently internalizes/runs on Java 11 with a Java XML parser (probably an internalized version of Apache Xerces) so I would like to try whether that feature or a known Xerces parser feature can be set with the Python API for Saxon as well.
Browsing through the API I don't see any obvious way, the closest is e.g.
set_configuration_property but I think that is for Saxon's predefined configuration properties.
Is there any way to get a deeper API levels/settings like that parser feature from Python?
Please register to reply
Digging through the Saxon 12 Java API I found e.g. https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/lib/Feature.html#XML_PARSER_FEATURE saying
public static final Feature<java.lang.Boolean> XML_PARSER_FEATURE Sets the value of a parser feature flag. The feature name is any fully-qualified URI.
For example if the parser supports a feature http://xml.org/sax/features/external-parameter-entities then this can be set by setting the value of the Configuration property: http://saxon.sf.net/feature/parserFeature?uri=http%3A//xml.org/sax/features/external-parameter-entities to true.
Based on that I tried
from saxonche import * with PySaxonProcessor(license=True) as proc: print(proc.version) proc.set_cwd('.') proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false') #xdm_doc = proc.parse_xml(xml_file_name='sample1.xml') xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml') if proc.exception_occurred: print(proc.error_message) else: print(xdm_doc)
but it doesn't seem to work, the
set_configuration_property doesn't give an error and if I load/parse a file with no DOCTYPE referencing an external DTD the document is parsed fine; a document referencing an external DTD that doesn't exist, however, gives no exception/error but the
xdm_doc is None i.e. the output is
SaxonC-HE 12.0 from Saxonica None
Any hints where/why that approach fails appreciated.
Java code like this seems to do the job in the Java API, so it seems the above Python should also work:
Processor processor = new Processor(); String featureName = Feature.XML_PARSER_FEATURE.name + "http%3A//apache.org/xml/features/nonvalidating/load-external-dtd"; System.out.println(featureName); processor.setConfigurationProperty(featureName, false);
I kind of got it working in a very odd way, browsing through https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_11_4/entry/src/main/c/Saxon.C.API/SaxonProcessor.cpp it appears configuration properties are only applied by a method
applyConfigurationProperties called when a new XPath/XQuery/XSLT processor is created, so I create one before calling
parse_xml with e.g.
from saxonche import * with PySaxonProcessor(license=True) as proc: print(proc.version) proc.set_cwd('.') proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false') xslt_processor = proc.new_xslt30_processor() xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml') if proc.exception_occurred: print(proc.error_message) else: print(xdm_doc)
y voilá, the referenced, not existing external DTD is indeed ignored.
So somehow it is possible to set a parser feature from the Python API, nice to know.
Good spot. I think we need to extend the call on
applyConfigurationProperties to the parseXml methods too.
Add the following bug issue to track this: #5885
Please register to reply