Bug #1813
closedAbsent XHTML DTD entities
100%
Description
From Ricki Brown ricki.w.brown@gmail.com in direct email:
I've been trying out Saxon as a replacement for the standard Java transformer and I was wanting to transform XHTML documents (possibly my first mistake) to obtain (say) a list of image src attributes. I'm not sure if this is a good idea exactly; I did consider using alternatives like JSoup but the rest of my code uses XSLT in some form.
So my documents looks like
<title>Hello World</title>Some text
with
<xsl:stylesheet
version="1.0"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
exclude-result-prefixes="xhtml">
<xsl:output method="xml"
indent="yes"
encoding="UTF-8"
standalone="yes"/>
<xsl:template match="/">
<images>
<xsl:apply-templates/>
</images>
</xsl:template>
<xsl:template match="text()"/>
<xsl:template match="//xhtml:img">
<image><xsl:value-of select="@src"/></image>
</xsl:template>
</xsl:stylesheet>
and were taking a long time to transform. After reading around on the subject matter I understand that Saxon uses its own Entity Resolver to fetch common entities from within the Saxon jar file but when I paused the process there were HTTP connections active.
When I disabled my internet connection and ran something like
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(myStylesheet);
Source src = new StreamSource(myFile);
Result res = new StreamResult(System.out);
transformer.transform(src, res);
I got
Exception in thread "main" net.sf.saxon.trans.XPathException: I/O error reported by XML parser processing file:/: www.w3.org
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:427)
at net.sf.saxon.event.Sender.send(Sender.java:169)
at net.sf.saxon.Controller.transform(Controller.java:1890)
Caused by: java.net.UnknownHostException: www.w3.org
I downloaded the source and attached a breakpoint to the last line of StandardEntityResolver's resolveEntity method and found that the following entities aren't mapped
-//W3C//ELEMENTS XHTML Inline Style 1.0//EN
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-inlstyle-1.mod
-//W3C//ELEMENTS XHTML Editing Elements 1.0//EN
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-edit-1.mod
-//W3C//ELEMENTS XHTML BIDI Override Element 1.0//EN
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-bdo-1.mod
-//W3C//ELEMENTS XHTML Style Sheets 1.0//EN
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-style-1.mod
I can register these entities myself by calling StandardEntityResolver.register with the appropriate arguments and then everything works without an internet connection.
Updated by Michael Kay over 11 years ago
- Status changed from New to In Progress
The first thing I notice is that you are using the DTD at
http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
whereas the one that Saxon has embedded is
http://www.w3.org/MarkUp/DTD/xhtml11.dtd
The public identifier is the same in both cases.
The content of both locations is very similar but the /TR/ version seems to have a 2010 date while the /Markup/ version has a 2009 date.
More confusingly still, the content of the /DTD/ version instructs you to invoke it using
However, Saxon uses the public ID in preference, so -//W3C//DTD XHTML 1.1//EN should be resolved to Saxon's internal copy. The fact that you used a different system ID should therefore be irrelevant.
I don't know if parsers are selective in downloading parts of the DTD (downloading a component only if it is needed). If they are, this could explain how you get a failure that we didn't get in our testing. In turn this suggests that testing probably isn't a good way of ensuring our list is complete.
Updated by Michael Kay over 11 years ago
- Status changed from In Progress to Resolved
What's happening is that there are various versions of the XHTML DTDs using different system IDs and different public IDs. Saxon should retrieve these correctly based on the public ID, but in some cases (specifically the four cases identified) the links from one DTD to another are using the incorrect public ID. The four inconsistencies are:
-//W3C//DTD XHTML Style Sheets 1.0//EN
-//W3C//ELEMENTS XHTML Style Sheets 1.0//EN
-//W3C//ELEMENTS XHTML BDO Element 1.0//EN
-//W3C//ELEMENTS XHTML BIDI Override Element 1.0//EN
-//W3C//ELEMENTS XHTML Editing Markup 1.0//EN
-//W3C//ELEMENTS XHTML Editing Elements 1.0//EN
-//W3C//ENTITIES XHTML Inline Style 1.0//EN
-//W3C//ELEMENTS XHTML Inline Style 1.0//EN
In each case Saxon already has a copy of the file, but is not finding it because the link to it uses a different system ID and a different public ID.
I'm tempted to change the actual DTD contents to use the correct public IDs, but instead I have chosen to register the file in Saxon's internal catalog under both the correct and incorrect public identifiers.
I will also change the standard entity resolver so that in cases where the -t option is set and a W3C URL is not resolved locally, a message to that effect is output.
Updated by O'Neil Delpratt about 11 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in version set to 9.5.1.2
Bug fix applied in the Saxon 9.5.1.2 maintenance release
Please register to edit this issue