Bug #2534: For some unicode characters, Saxon produces incorrect output when they are defined as XML entities in the source document - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2534

closed

For some unicode characters, Saxon produces incorrect output when they are defined as XML entities in the source document

Added by Peter Ross over 8 years ago. Updated over 8 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Third-party product

Sprint/Milestone:

Start date:

2015-12-10

Due date:

% Done:

Estimated time:

Legacy ID:

Applies to branch:

Fix Committed on Branch:

Fixed in Maintenance Release:

Platforms:

Description

The affected unicode characters are rare Chinese characters.

If an affected character is defined as an XML entity in the source document, and is used inside an attribute, Saxon produces garbage output.

e.g. ...

Whereas if an affected character is defined inline using ampersand notation in the source document, Saxon produces correct output.

e.g.

Please use the attached files to reproduce the problem. Te XSLT performs a simple transformation of the source document.

% java -cp saxon9he.jar net.sf.saxon.Transform -s:inline.xml -xsl:test.xsl -o:out-inline.xml

% java -cp saxon9he.jar net.sf.saxon.Transform -s:entity.xml -xsl:test.xsl -o:out-entity.xml

% diff -u out-escape.xml out-entity.xml

I expect the two output files to be identical. And if I use a different xslt processor, such as libxsltproc, I do get identical output.

Below is snippet of the output to give you an idea of what is going on. It seems that the affected characters are duplicated in the output stage.

out-inline.xml

==============

...

out-entity.xml

==========

...

Files

Download all files

test.xsl (438 Bytes) test.xsl		Peter Ross, 2015-12-10 06:48
inline.xml (374 Bytes) inline.xml		Peter Ross, 2015-12-10 06:48
entity.xml (743 Bytes) entity.xml		Peter Ross, 2015-12-10 06:48

Actions

Copy link

Updated by Peter Ross over 8 years ago

Typo: diff command should be

% diff out-inline.xml out-entity.xml

Actions

Copy link

Updated by Michael Kay over 8 years ago

Found in version changed from Saxon-HE 9.7.0.1J from Saxonica to 9.7

This is almost certainly the same problem as #2533

https://saxonica.plan.io/issues/2533

which is a long-standing bug in the XML parser embedded in the JDK.

Please check whether the problem occurs when you use Apache Xerces. We recommend always using the version of Xerces from Apache rather than the one in the JDK, because the JDK parser has delivered corrupted attribute values for years and Oracle seem quite uninterested in fixing the problem.

Actions

Copy link

Updated by Peter Ross over 8 years ago

Michael Kay wrote:

Please check whether the problem occurs when you use Apache Xerces. We recommend always using the version of Xerces from Apache rather than the one in the JDK, because the JDK parser has delivered corrupted attribute values for years and Oracle seem quite uninterested in fixing the problem.

Thanks. When I switch to Apache Xerces the problem goes away. Please close this issue.

[Given that this problem is reproducible, and others are suffering from it, have you considered detecting the broken JDK parser on Saxon startup and emitting a warning?]

Actions

Copy link

Updated by Michael Kay over 8 years ago

Category set to Third-party product
Status changed from New to Closed
Assignee set to Michael Kay
Priority changed from Low to Normal

Let's see if Oracle are more responsive to the latest bug report than they were last time.

I'm reluctant to start issuing warnings about using the JDK parser because people hate warnings; the difficulty is that although this problem is reproducible, it only affects a very small number of applications (or only affects them a very small proportion of the time), and under such conditions many people would regard the warning as alarmist.

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #2534

For some unicode characters, Saxon produces incorrect output when they are defined as XML entities in the source document

Updated by Peter Ross over 8 years ago

Updated by Michael Kay over 8 years ago

Updated by Peter Ross over 8 years ago

Updated by Michael Kay over 8 years ago