Bug #2533: Character Duplication during Serialization - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #2533

closed

Character Duplication during Serialization

Added by Nick Nunes over 8 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

JAXP Java API

Sprint/Milestone:

Start date:

2015-12-07

Due date:

% Done:

Estimated time:

Legacy ID:

Applies to branch:

Fix Committed on Branch:

Fixed in Maintenance Release:

Platforms:

Description

Hi,

We've seen this bug a few times over the years and were finally able to isolate it. When various Unicode characters from higher planes show up in attributes, during serialization they will be duplicated. Since our pipeline serializes multiple times, we always encounter this as exponentially ballooning file sizes. In the attached example the specific character is U+1D6A4 "MATHEMATICAL ITALIC SMALL DOTLESS I". In the file input.xml, it appears twice. When run with a basic identity transform, the output will contain the character three times.

I've been able to replicate this in multiple versions of Saxon, as far back as 8.9 EE and as recent as 9.7.0.1J PE. Interestingly, I am not able to duplicate it in Oxygen 16.1.

If and when this is fixed, if we could get a maintenance release of 9.5 (the version we use in our processing pipeline) it would be very helpful.

Thank you for your assistance.

Files

CharacterDuplicationBug.zip (845 Bytes) CharacterDuplicationBug.zip

Nick Nunes, 2015-12-07 20:17

Actions

Copy link

Updated by Nick Nunes over 8 years ago

Correction, just added some introspection, the duplication happens during parsing, not serialization.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Which XML parser are you using? Corruption of attribute values is a common problem with the JDK parser, which is why I always recommend use of Apache Xerces in preference. It's possible it's been fixed in JDK 8, but I'll hold my judgement on that.

Actions

Copy link

Updated by Michael Kay over 8 years ago

Here's what the input file contains in hex, according to net.sf.saxon.functions.UnparsedText:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð ¤ ð ¤ " / >

So yes, the sequence (f0 9d 9a a4) appears twice in the value of the attribute.

When I transform this with:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.6.0_27

Generating byte code...

Stylesheet compilation time: 355.856ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser

I get a file whose content is identical:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð ¤ ð ¤ " / >

If I remove the Apache parser from the classpath and use the JDK parser:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.6.0_27

Stylesheet compilation time: 300.808ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser

I get this output file:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð ¤ ð ¤ ð ¤ " / >

which contains the character 3 times.

So yes, it's the old JDK parser bug, I'm afraid.

I ran it with Java 8:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.8.0_25

Stylesheet compilation time: 361.745129ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser

and it seems the bug is still there:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð ¤ ð ¤ ð ¤ " / >

Just use Apache Xerces!

Actions

Copy link

Updated by Michael Kay over 8 years ago

Category set to JAXP Java API
Priority changed from Low to Normal

I have reported a bug to Oracle. I've done this in the past with no discernable effect, but who knows, they might be more interested this time. Here is their acknowledgement:

Dear Java Developer,

Thank you for reporting this issue.

We are evaluating this report and have assigned it a Review ID: JI-9027283. In the event this report is determined to be a defect or enhancement request, it will be referenced with a new Bug ID and will be listed on Bugs.java.com. For other related issues, please visit our Bug Database at http://bugs.java.com.

We try to process all newly posted bugs in a timely manner, but make no promises about the amount of time in which a bug might be fixed. If the issue just reported could have a major impact on your project, consider using one of the technical support offerings available at Oracle Support.

Regards,

Java Community Developer Support

Actions

Copy link