Project

Profile

Help

Whitespace collapsing schema elements and JAXP validating document parses

Added by Chris Dennis almost 4 years ago

I'm currently grappling with a behavioral difference between Xerces (in it's guise as the default XML parser in Oracle/OpenJDK) and Saxon-EE when building a DOM with validation.

Consider the schema:

<xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/test">
  <xs:element name="test">
    <xs:simpleType>
      <xs:restriction base="xs:string">
        <xs:whiteSpace value="collapse"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>
</xs:schema>

and associated document

<test xmlns='http://www.example.com/test'>    foo    bar    </test>

When parsed using the following Java code:

SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema testSchema = schemaFactory.newSchema(SaxonTest.class.getResource("/test.xsd"));

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setSchema(testSchema);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
documentBuilder.setErrorHandler(new ConfigurationParser.FatalErrorHandler());

Document document = documentBuilder.parse(SaxonTest.class.getResource("/test.xml").toURI().toString());

System.out.println(document.getDocumentElement().getTextContent());
Saxon prints what I understand to be the "correct" non whitespace-normalizing string "    foo    bar    ". Xerces on the other hand collapses the whitespace and prints "foo bar".

What I haven't been able to work out is how to get Saxon to normalize the whitespace as per the schema and give me "foo bar" string?

Replies (2)

RE: Whitespace collapsing schema elements and JAXP validating document parses - Added by Michael Kay almost 4 years ago

What implementation classes are you picking up here for SchemaFactory and DocumentBuilderFactory? Saxon has an implementation of SchemaFactory but it doesn't have an implementation for DocumentBuilderFactory. The reason for that, IIRC, is that Saxon's DocumentBuilder constructs a read-only DOM, and that's something you don't want to happen by accident, just because Saxon happens to be around on your classpath.

Generally, DOM is very much a second-class citizen as far as Saxon is concerned: XSLT and XQuery play much more nicely with immutable tree models, and with a closer conformance to XDM.

Related to this, the idea in XSD is that validation should produce a PSVI (post schema validation infoset). But the standard DOM doesn't provide a PSVI, so the effect of validating while constructing a DOM isn't really defined in any standard. Saxon doesn't implement the full PSVI, it only implements what's needed for schema-aware XSLT and XQuery processing, which is to attach type annotations to nodes that have been validated; and Saxon only does this for its own tree models (based on XDM) rather than for the DOM.

RE: Whitespace collapsing schema elements and JAXP validating document parses - Added by Chris Dennis almost 4 years ago

SchemaFactory is picking up com.saxonica.ee.jaxp.SchemaFactoryImpl (from Saxon-EE on the classpath). DocumentBuilderFactory is picking up com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl from the JRE. It appears that Xerces started doing this with this changeset: http://svn.apache.org/viewvc?view=revision&revision=318145 Their ValidatorHandler implementation normalizes the string on the way through. This does seem to be at odds with the spec for org.w3c.dom.Node#getTextContent() "No whitespace normalization is performed and the returned string does not contain the white spaces in element content (see the attribute Text.isElementContentWhitespace).".

I'm the developer of a library that needs to play nicely with as many environments as possible. This library uses XML for configuration and has a configuration extension system based around DOM objects. I'm therefore trying to see if it's possible to get a JAXP -> DOM parsing setup that normalizes whitespace based on XSD type information when Saxon is the users default SchemaFactory implementation.

It sounds like this would only be possible if I were to switch to using XDM as a intermediate representation to get access to the PSVI info, perform the whitespace normalization, and then finally convert to DOM. Is that correct?

    (1-2/2)

    Please register to reply