Project

Profile

Help

Bug #2533

closed

Character Duplication during Serialization

Added by Nick Nunes over 8 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
JAXP Java API
Sprint/Milestone:
-
Start date:
2015-12-07
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Hi,

We've seen this bug a few times over the years and were finally able to isolate it. When various Unicode characters from higher planes show up in attributes, during serialization they will be duplicated. Since our pipeline serializes multiple times, we always encounter this as exponentially ballooning file sizes. In the attached example the specific character is U+1D6A4 "MATHEMATICAL ITALIC SMALL DOTLESS I". In the file input.xml, it appears twice. When run with a basic identity transform, the output will contain the character three times.

I've been able to replicate this in multiple versions of Saxon, as far back as 8.9 EE and as recent as 9.7.0.1J PE. Interestingly, I am not able to duplicate it in Oxygen 16.1.

If and when this is fixed, if we could get a maintenance release of 9.5 (the version we use in our processing pipeline) it would be very helpful.

Thank you for your assistance.


Files

CharacterDuplicationBug.zip (845 Bytes) CharacterDuplicationBug.zip Nick Nunes, 2015-12-07 20:17
Actions #1

Updated by Nick Nunes over 8 years ago

Correction, just added some introspection, the duplication happens during parsing, not serialization.

Actions #2

Updated by Michael Kay over 8 years ago

Which XML parser are you using? Corruption of attribute values is a common problem with the JDK parser, which is why I always recommend use of Apache Xerces in preference. It's possible it's been fixed in JDK 8, but I'll hold my judgement on that.

Actions #3

Updated by Michael Kay over 8 years ago

Here's what the input file contains in hex, according to net.sf.saxon.functions.UnparsedText:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð  š ¤ ð  š ¤ " / >

So yes, the sequence (f0 9d 9a a4) appears twice in the value of the attribute.

When I transform this with:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.6.0_27

Generating byte code...

Stylesheet compilation time: 355.856ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser

I get a file whose content is identical:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð  š ¤ ð  š ¤ " / >

If I remove the Apache parser from the classpath and use the JDK parser:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.6.0_27

Stylesheet compilation time: 300.808ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser

I get this output file:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð  š ¤ ð  š ¤ ð  š ¤ " / >

which contains the character 3 times.

So yes, it's the old JDK parser bug, I'm afraid.

I ran it with Java 8:

Saxon-EE 9.7.0.1J from Saxonica

Java version 1.8.0_25

Stylesheet compilation time: 361.745129ms

Processing file:/Users/mike/bugs/2015/nunes/input.xml

Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser

and it seems the bug is still there:

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e

< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n

67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22

g = " U T F - 8 " ? > < n a m e s o r t a b l e = "

f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e

ð  š ¤ ð  š ¤ ð  š ¤ " / >

Just use Apache Xerces!

Actions #4

Updated by Michael Kay over 8 years ago

  • Category set to JAXP Java API
  • Priority changed from Low to Normal

I have reported a bug to Oracle. I've done this in the past with no discernable effect, but who knows, they might be more interested this time. Here is their acknowledgement:

Dear Java Developer,

Thank you for reporting this issue.

We are evaluating this report and have assigned it a Review ID: JI-9027283. In the event this report is determined to be a defect or enhancement request, it will be referenced with a new Bug ID and will be listed on Bugs.java.com. For other related issues, please visit our Bug Database at http://bugs.java.com.

We try to process all newly posted bugs in a timely manner, but make no promises about the amount of time in which a bug might be fixed. If the issue just reported could have a major impact on your project, consider using one of the technical support offerings available at Oracle Support.

Regards,

Java Community Developer Support

Actions #5

Updated by Michael Kay over 8 years ago

  • Status changed from New to Closed
  • Assignee set to Michael Kay

Marking this as Closed: it is a JDK bug, with a workaround that involves using the Apache Xerces parser.

Actions #6

Updated by Michael Kay almost 6 years ago

Update: this bug appears to be fixed in Java 9. See

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8145969

Please register to edit this issue

Also available in: Atom PDF