Project

Profile

Help

setting parser features for the XML parser SaxonC uses

Added by Martin Honnen about 1 year ago

With some help on Slack I learned that Apache Xerces has a parser feature http://apache.org/xml/features/nonvalidating/load-external-dtd you can set to false to avoid loading the DTD.

I have tested that SaxonJ HE 12.0 allows me doing that for its Configuration with Java 11 doing

        Processor processor = new Processor();
        Configuration configuration = processor.getUnderlyingConfiguration();
        configuration.setParseOptions(configuration.getParseOptions().withParserFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false));

        DocumentBuilder docBuilder = processor.newDocumentBuilder();

        XdmNode inputDoc = docBuilder.build(new File("sample1.xml"));

I take it that SaxonC somehow currently internalizes/runs on Java 11 with a Java XML parser (probably an internalized version of Apache Xerces) so I would like to try whether that feature or a known Xerces parser feature can be set with the Python API for Saxon as well.

Browsing through the API I don't see any obvious way, the closest is e.g. set_configuration_property but I think that is for Saxon's predefined configuration properties.

Is there any way to get a deeper API levels/settings like that parser feature from Python?


Replies (5)

Please register to reply

RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen about 1 year ago

Digging through the Saxon 12 Java API I found e.g. https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/lib/Feature.html#XML_PARSER_FEATURE saying

public static final Feature<java.lang.Boolean> XML_PARSER_FEATURE Sets the value of a parser feature flag. The feature name is any fully-qualified URI.

For example if the parser supports a feature http://xml.org/sax/features/external-parameter-entities then this can be set by setting the value of the Configuration property: http://saxon.sf.net/feature/parserFeature?uri=http%3A//xml.org/sax/features/external-parameter-entities to true.

Based on that I tried

from saxonche import *

with PySaxonProcessor(license=True) as proc:
    print(proc.version)

    proc.set_cwd('.')

    proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false')

    #xdm_doc = proc.parse_xml(xml_file_name='sample1.xml')
    xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml')

    if proc.exception_occurred:
        print(proc.error_message)
    else:
        print(xdm_doc)

but it doesn't seem to work, the set_configuration_property doesn't give an error and if I load/parse a file with no DOCTYPE referencing an external DTD the document is parsed fine; a document referencing an external DTD that doesn't exist, however, gives no exception/error but the xdm_doc is None i.e. the output is

SaxonC-HE 12.0 from Saxonica
None

Any hints where/why that approach fails appreciated.

RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen about 1 year ago

Java code like this seems to do the job in the Java API, so it seems the above Python should also work:

        Processor processor = new Processor();

        String featureName = Feature.XML_PARSER_FEATURE.name + "http%3A//apache.org/xml/features/nonvalidating/load-external-dtd";

        System.out.println(featureName);

        processor.setConfigurationProperty(featureName, false);

I

RE: setting parser features for the XML parser SaxonC uses - Added by Martin Honnen about 1 year ago

I kind of got it working in a very odd way, browsing through https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_11_4/entry/src/main/c/Saxon.C.API/SaxonProcessor.cpp it appears configuration properties are only applied by a method applyConfigurationProperties called when a new XPath/XQuery/XSLT processor is created, so I create one before calling parse_xml with e.g.

from saxonche import *

with PySaxonProcessor(license=True) as proc:
    print(proc.version)

    proc.set_cwd('.')

    proc.set_configuration_property('http://saxon.sf.net/feature/parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd', 'false')

    xslt_processor = proc.new_xslt30_processor()

    xdm_doc = proc.parse_xml(xml_file_name='sample1-no-dtd1.xml')

    if proc.exception_occurred:
        print(proc.error_message)
    else:
        print(xdm_doc)

y voilá, the referenced, not existing external DTD is indeed ignored.

So somehow it is possible to set a parser feature from the Python API, nice to know.

RE: setting parser features for the XML parser SaxonC uses - Added by O'Neil Delpratt about 1 year ago

Good spot. I think we need to extend the call on applyConfigurationProperties to the parseXml methods too.

RE: setting parser features for the XML parser SaxonC uses - Added by O'Neil Delpratt about 1 year ago

Add the following bug issue to track this: #5885

    (1-5/5)

    Please register to reply