Whitespace collapsing schema elements and JAXP validating document parses
Added by Chris Dennis almost 4 years ago
I'm currently grappling with a behavioral difference between Xerces (in it's guise as the default XML parser in Oracle/OpenJDK) and Saxon-EE when building a DOM with validation.
Consider the schema:
<xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/test">
<xs:element name="test">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:whiteSpace value="collapse"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:schema>
and associated document
<test xmlns='http://www.example.com/test'> foo bar </test>
When parsed using the following Java code:
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema testSchema = schemaFactory.newSchema(SaxonTest.class.getResource("/test.xsd"));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setSchema(testSchema);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
documentBuilder.setErrorHandler(new ConfigurationParser.FatalErrorHandler());
Document document = documentBuilder.parse(SaxonTest.class.getResource("/test.xml").toURI().toString());
System.out.println(document.getDocumentElement().getTextContent());
Saxon prints what I understand to be the "correct" non whitespace-normalizing string " foo bar ". Xerces on the other hand collapses the whitespace and prints "foo bar".
What I haven't been able to work out is how to get Saxon to normalize the whitespace as per the schema and give me "foo bar" string?
Replies (2)
RE: Whitespace collapsing schema elements and JAXP validating document parses - Added by Michael Kay almost 4 years ago
What implementation classes are you picking up here for SchemaFactory
and DocumentBuilderFactory
? Saxon has an implementation of SchemaFactory
but it doesn't have an implementation for DocumentBuilderFactory
. The reason for that, IIRC, is that Saxon's DocumentBuilder
constructs a read-only DOM, and that's something you don't want to happen by accident, just because Saxon happens to be around on your classpath.
Generally, DOM is very much a second-class citizen as far as Saxon is concerned: XSLT and XQuery play much more nicely with immutable tree models, and with a closer conformance to XDM.
Related to this, the idea in XSD is that validation should produce a PSVI (post schema validation infoset). But the standard DOM doesn't provide a PSVI, so the effect of validating while constructing a DOM isn't really defined in any standard. Saxon doesn't implement the full PSVI, it only implements what's needed for schema-aware XSLT and XQuery processing, which is to attach type annotations to nodes that have been validated; and Saxon only does this for its own tree models (based on XDM) rather than for the DOM.
RE: Whitespace collapsing schema elements and JAXP validating document parses - Added by Chris Dennis almost 4 years ago
SchemaFactory
is picking up com.saxonica.ee.jaxp.SchemaFactoryImpl
(from Saxon-EE on the classpath). DocumentBuilderFactory
is picking up com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
from the JRE. It appears that Xerces started doing this with this changeset: http://svn.apache.org/viewvc?view=revision&revision=318145 Their ValidatorHandler
implementation normalizes the string on the way through. This does seem to be at odds with the spec for org.w3c.dom.Node#getTextContent()
"No whitespace normalization is performed and the returned string does not contain the white spaces in element content (see the attribute Text.isElementContentWhitespace).".
I'm the developer of a library that needs to play nicely with as many environments as possible. This library uses XML for configuration and has a configuration extension system based around DOM objects. I'm therefore trying to see if it's possible to get a JAXP -> DOM parsing setup that normalizes whitespace based on XSD type information when Saxon is the users default SchemaFactory implementation.
It sounds like this would only be possible if I were to switch to using XDM as a intermediate representation to get access to the PSVI info, perform the whitespace normalization, and then finally convert to DOM. Is that correct?
Please register to reply