Forums » Saxon/C Help and Discussions »
setting parser features for the XML parser SaxonC uses
Added by Martin Honnen over 1 year ago
With some help on Slack I learned that Apache Xerces has a parser feature http://apache.org/xml/features/nonvalidating/load-external-dtd
you can set to false to avoid loading the DTD.
I have tested that SaxonJ HE 12.0 allows me doing that for its Configuration with Java 11 doing
Processor processor = new Processor();
Configuration configuration = processor.getUnderlyingConfiguration();
configuration.setParseOptions(configuration.getParseOptions().withParserFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false));
DocumentBuilder docBuilder = processor.newDocumentBuilder();
XdmNode inputDoc = docBuilder.build(new File("sample1.xml"));
I take it that SaxonC somehow currently internalizes/runs on Java 11 with a Java XML parser (probably an internalized version of Apache Xerces) so I would like to try whether that feature or a known Xerces parser feature can be set with the Python API for Saxon as well.
Browsing through the API I don't see any obvious way, the closest is e.g. set_configuration_property
but I think that is for Saxon's predefined configuration properties.
Is there any way to get a deeper API levels/settings like that parser feature from Python?
Replies (5)
Please register to reply
RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen over 1 year ago
Digging through the Saxon 12 Java API I found e.g. https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/lib/Feature.html#XML_PARSER_FEATURE saying
public static final Feature<java.lang.Boolean> XML_PARSER_FEATURE Sets the value of a parser feature flag. The feature name is any fully-qualified URI.
For example if the parser supports a feature http://xml.org/sax/features/external-parameter-entities then this can be set by setting the value of the Configuration property: http://saxon.sf.net/feature/parserFeature?uri=http%3A//xml.org/sax/features/external-parameter-entities to true.
Based on that I tried
from saxonche import *
with PySaxonProcessor(license=True) as proc:
print(proc.version)
proc.set_cwd('.')
proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false')
#xdm_doc = proc.parse_xml(xml_file_name='sample1.xml')
xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml')
if proc.exception_occurred:
print(proc.error_message)
else:
print(xdm_doc)
but it doesn't seem to work, the set_configuration_property
doesn't give an error and if I load/parse a file with no DOCTYPE referencing an external DTD the document is parsed fine; a document referencing an external DTD that doesn't exist, however, gives no exception/error but the xdm_doc
is None i.e. the output is
SaxonC-HE 12.0 from Saxonica
None
Any hints where/why that approach fails appreciated.
RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen over 1 year ago
Java code like this seems to do the job in the Java API, so it seems the above Python should also work:
Processor processor = new Processor();
String featureName = Feature.XML_PARSER_FEATURE.name + "http%3A//apache.org/xml/features/nonvalidating/load-external-dtd";
System.out.println(featureName);
processor.setConfigurationProperty(featureName, false);
I
RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen over 1 year ago
I kind of got it working in a very odd way, browsing through https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_11_4/entry/src/main/c/Saxon.C.API/SaxonProcessor.cpp it appears configuration properties are only applied by a method applyConfigurationProperties
called when a new XPath/XQuery/XSLT processor is created, so I create one before calling parse_xml
with e.g.
from saxonche import *
with PySaxonProcessor(license=True) as proc:
print(proc.version)
proc.set_cwd('.')
proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false')
xslt_processor = proc.new_xslt30_processor()
xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml')
if proc.exception_occurred:
print(proc.error_message)
else:
print(xdm_doc)
y voilá, the referenced, not existing external DTD is indeed ignored.
So somehow it is possible to set a parser feature from the Python API, nice to know.
RE: setting parser features for the XML parser SaxonC uses - Added by O'Neil Delpratt over 1 year ago
Good spot. I think we need to extend the call on applyConfigurationProperties
to the parseXml methods too.
RE: setting parser features for the XML parser SaxonC uses - Added by O'Neil Delpratt over 1 year ago
Add the following bug issue to track this: #5885
Please register to reply