Bug #2533
closedCharacter Duplication during Serialization
0%
Description
Hi,
We've seen this bug a few times over the years and were finally able to isolate it. When various Unicode characters from higher planes show up in attributes, during serialization they will be duplicated. Since our pipeline serializes multiple times, we always encounter this as exponentially ballooning file sizes. In the attached example the specific character is U+1D6A4 "MATHEMATICAL ITALIC SMALL DOTLESS I". In the file input.xml, it appears twice. When run with a basic identity transform, the output will contain the character three times.
I've been able to replicate this in multiple versions of Saxon, as far back as 8.9 EE and as recent as 9.7.0.1J PE. Interestingly, I am not able to duplicate it in Oxygen 16.1.
If and when this is fixed, if we could get a maintenance release of 9.5 (the version we use in our processing pipeline) it would be very helpful.
Thank you for your assistance.
Files
Updated by Nick Nunes about 9 years ago
Correction, just added some introspection, the duplication happens during parsing, not serialization.
Updated by Michael Kay about 9 years ago
Which XML parser are you using? Corruption of attribute values is a common problem with the JDK parser, which is why I always recommend use of Apache Xerces in preference. It's possible it's been fixed in JDK 8, but I'll hold my judgement on that.
Updated by Michael Kay about 9 years ago
Here's what the input file contains in hex, according to net.sf.saxon.functions.UnparsedText:
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e
< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n
67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22
g = " U T F - 8 " ? > < n a m e s o r t a b l e = "
f0 9d 9a a4 f0 9d 9a a4 22 2f 3e
ð ¤ ð ¤ " / >
So yes, the sequence (f0 9d 9a a4) appears twice in the value of the attribute.
When I transform this with:
Saxon-EE 9.7.0.1J from Saxonica
Java version 1.6.0_27
Generating byte code...
Stylesheet compilation time: 355.856ms
Processing file:/Users/mike/bugs/2015/nunes/input.xml
Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
I get a file whose content is identical:
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e
< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n
67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22
g = " U T F - 8 " ? > < n a m e s o r t a b l e = "
f0 9d 9a a4 f0 9d 9a a4 22 2f 3e
ð ¤ ð ¤ " / >
If I remove the Apache parser from the classpath and use the JDK parser:
Saxon-EE 9.7.0.1J from Saxonica
Java version 1.6.0_27
Stylesheet compilation time: 300.808ms
Processing file:/Users/mike/bugs/2015/nunes/input.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
I get this output file:
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e
< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n
67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22
g = " U T F - 8 " ? > < n a m e s o r t a b l e = "
f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e
ð ¤ ð ¤ ð ¤ " / >
which contains the character 3 times.
So yes, it's the old JDK parser bug, I'm afraid.
I ran it with Java 8:
Saxon-EE 9.7.0.1J from Saxonica
Java version 1.8.0_25
Stylesheet compilation time: 361.745129ms
Processing file:/Users/mike/bugs/2015/nunes/input.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
and it seems the bug is still there:
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e
< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n
67 3d 22 55 54 46 2d 38 22 3f 3e 3c 6e 61 6d 65 20 73 6f 72 74 61 62 6c 65 3d 22
g = " U T F - 8 " ? > < n a m e s o r t a b l e = "
f0 9d 9a a4 f0 9d 9a a4 f0 9d 9a a4 22 2f 3e
ð ¤ ð ¤ ð ¤ " / >
Just use Apache Xerces!
Updated by Michael Kay about 9 years ago
- Category set to JAXP Java API
- Priority changed from Low to Normal
I have reported a bug to Oracle. I've done this in the past with no discernable effect, but who knows, they might be more interested this time. Here is their acknowledgement:
Dear Java Developer,
Thank you for reporting this issue.
We are evaluating this report and have assigned it a Review ID: JI-9027283. In the event this report is determined to be a defect or enhancement request, it will be referenced with a new Bug ID and will be listed on Bugs.java.com. For other related issues, please visit our Bug Database at http://bugs.java.com.
We try to process all newly posted bugs in a timely manner, but make no promises about the amount of time in which a bug might be fixed. If the issue just reported could have a major impact on your project, consider using one of the technical support offerings available at Oracle Support.
Regards,
Java Community Developer Support
Updated by Michael Kay about 9 years ago
- Status changed from New to Closed
- Assignee set to Michael Kay
Marking this as Closed: it is a JDK bug, with a workaround that involves using the Apache Xerces parser.
Updated by Michael Kay over 6 years ago
Update: this bug appears to be fixed in Java 9. See
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8145969
Please register to edit this issue