upper unicode characters in entity references
Added by Anonymous almost 18 years ago
Legacy ID: #4152651 Legacy Poster: Christian Wittern (cwittern)
I just noticed that my saxon (8.71J) has a problem with upper unicode characters (by this, I mean characters that have to be expressed using surrogates in UTF-8), as can be demonstrated by these files: test.xml: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE div [ <!ELEMENT div ANY > <!ENTITY test1 '
Replies (7)
Please register to reply
RE: upper unicode characters in entity refere - Added by Anonymous almost 18 years ago
Legacy ID: #4156744 Legacy Poster: Michael Kay (mhkay)
I'm afraid the obvious inference from my point of view is that test1 (whose content I can't actually see in either the forum or the email notification) is incorrectly encoded. Either that, or the XML parser has processed it incorrectly and passed the wrong thing to Saxon, which seems unlikely, but you don't actually say which XML parser you are using. It would be clearer if you used say us-ascii for the output encoding so we could see clearly what the characters actually are. But I think you will find that Saxon is simply passing on what it is given by the XML parser. Michael Kay Saxonica
RE: upper unicode characters in entity refere - Added by Anonymous almost 18 years ago
Legacy ID: #4157499 Legacy Poster: Christian Wittern (cwittern)
Thanks for responding to this. >I'm afraid the obvious inference from my point of view is that test1 (whose >content I can't actually see in either the forum or the email notification) Should I send it to you as email attachment? That might get through without problems, although you still would need to have the correct font. >is incorrectly encoded. Either that, or the XML parser has processed it incorrectly >and passed the wrong thing to Saxon, which seems unlikely, but you don't actually >say which XML parser you are using. I don't (knowingly) use any XML parser, I am just calling saxon on the commandline. I get the same result however, if I do the transformation in Oxygen. >It would be clearer if you used say us-ascii for the output encoding so we could >see clearly what the characters actually are. But I think you will find that >Saxon is simply passing on what it is given by the XML parser. Using us-ascii as the output encoding (and the file slightly changed for readability), I get the following result: Source: <?xml version="1.0" encoding="utf-16"?> <!DOCTYPE div [ <!ELEMENT div ANY > <!ENTITY test1 '
RE: upper unicode characters in entity refere - Added by Anonymous almost 18 years ago
Legacy ID: #4157561 Legacy Poster: Michael Kay (mhkay)
You'd better send me a copy of the file (or post it as an attachment in the support-requests tracker)
RE: upper unicode characters in entity refere - Added by Anonymous almost 18 years ago
Legacy ID: #4160960 Legacy Poster: Christian Wittern (cwittern)
I have done so. Not knowing the dependencies, do you think it necessary to report to the Xerces team as well? The bug itself is pretty severe I think, since it leads to silent corruption of data, although the impact in terms of users involved might still be limited due to the special characters involved. Christian
RE: upper unicode characters in entity refere - Added by Anonymous almost 18 years ago
Legacy ID: #4161166 Legacy Poster: Michael Kay (mhkay)
I think it's a pretty clear bug and therefore I would recommend reporting it. That's the contribution you make in return for getting free open source software.
RE: upper unicode characters in entity references - Added by Anonymous over 17 years ago
Legacy ID: #4183257 Legacy Poster: Santiago Pericas-Geertsen (spericas)
I have looked at this issue in JAXP 1.4 / JDK 6.0. As far as I can tell, Xerces SAX is reporting these characters correctly. The character \u2643d is reported as the UTF-16 surrogate pair \ud859\udc3d which is correct, and the same surrogate pair is reported regardless of whether an entity is used or not. The code that I have you used to test this is shown below (I hope I haven't simplified the problem too much). Is it possible that the issue is in Saxon, in particular, in the way it serializes UTF-16 surrogate pairs? -- <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE div [ <!ELEMENT div ANY> <!ENTITY test2 '𦐽'> ]> <div>&test2;𦐽</div> -- import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import junit.textui.TestRunner; import junit.framework.TestCase; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; /** * Unit test for CR 6526547. * * @author Santiago.PericasGeertsen@sun.com */ public class Test extends TestCase { static class MyHandler extends DefaultHandler { public void characters(char ch[], int start, int length) throws SAXException { String s = new String(ch, start, length); assertTrue(Character.isSurrogatePair(ch[0], ch[1])); int c = Character.toCodePoint(ch[0], ch[1]); System.out.println("Codepoint = " + Integer.toString(c, 16)); } } public static void main(String [] args){ TestRunner.run(Test.class); } public Test() { } public void test() { try { SAXParserFactory spf = SAXParserFactory.newInstance(); SAXParser parser = spf.newSAXParser(); parser.parse(getClass().getResource("test.xml").getFile(), new MyHandler()); } catch (Exception e) { fail("Unable to configure parser"); } } } --
RE: upper unicode characters in entity refere - Added by Anonymous over 17 years ago
Legacy ID: #4183411 Legacy Poster: Michael Kay (mhkay)
As I reported on the support-requests forum https://sourceforge.net/tracker/index.php?func=detail&aid=1660205&group_id=29872&atid=397618 Saxon is delivering the correct results on JDK 1.4 and on .NET. The error occurs only on JDK 1.5, that is, with the Xerces parser. Moreover, it occurs only with characters in entity references. Since the Saxon code is identical on all platforms, and since Saxon has no way of distinguishing whether the characters were part of an entity reference or not (they are reported in exactly the same way), I think this is pretty convincing evidence of an XML parser bug. Of course, the bug may well be present in some versions of Xerces and not others. Michael Kay http://www.saxonica.com/
Please register to reply