Recover from Illegal character error?

Added by Anonymous over 14 years ago

Legacy ID: #8406439 Legacy Poster: Meikel Bisping (mbisping)

Even though everything is declared as UTF I sometimes get a net.sf.saxon.trans.DynamicError: Illegal HTML character from bad user input. Is there a way to tell Saxon to output question marks for such characters instead of giving up on the whole XSL-Transformation? Any help appreciated

Replies (5)

Please register to reply

RE: Recover from Illegal character error? - Added by Anonymous over 14 years ago

Legacy ID: #8406589 Legacy Poster: Michael Kay (mhkay)

No, I'm sorry, there was an intense debate on this in the W3C before the specification was finalized, and the outcome was that a conformant processor MUST report this error. Your only way around this (other that correcting the input at source, which is of course the preferred remedy) is to write your own (non-conformant) serializer by modifying or subclassing the Saxon serializer. The error usually occurs because the input XML declares its encoding as UTF-8, (explicitly or implicitly), when it is actually Windows cp1252.

RE: Recover from Illegal character error? - Added by Anonymous over 14 years ago

Legacy ID: #8413353 Legacy Poster: Meikel Bisping (mbisping)

We have Windows users copy text (with bullet signs and stuff like that) to a web form. It is then saved in a Postgres-Database and later reports are created with Saxon-XSLT. I had hoped that "illegal character errors" would end when changing the entire system from ISO-8859-1 to UFT8 (including a CharacterEncodingFilter for the web form - http://www.javaworld.com/javaworld/jw-09-2004/jw-0906-unicode.html?page=3) but problems remain. Do you have a suggestion how a system with tomcat, dbforms, postgres and saxon should be set up for European languages to avoid illegal character problems?

RE: Recover from Illegal character error? - Added by Anonymous over 14 years ago

Legacy ID: #8413366 Legacy Poster: Michael Kay (mhkay)

Do you have a suggestion how a system with tomcat, dbforms, postgres and saxon should be set up for European languages to avoid illegal character problems? I'm afraid that's way too big a question for this forum. However, the short answer is that to get this right, you need a full picture of all the data flows in the system, including the points at which character data enters the system, and you need to audit all these data flows to ensure that the sender and the recipient are in agreement about the character encoding that is used at each interface between system components. You also need to look at any place where persistent data is stored to ensure that the encoding of character data held in that data store is reliably known. Not easy to achieve.

RE: Recover from Illegal character error? - Added by Anonymous over 14 years ago

Legacy ID: #8413458 Legacy Poster: David Lee (daldei)

To add my .02 Europenny. Not only do you need to look for, and control, every input and output flow of data end to end, including databases, files, HTML Forms etc. But you mentioned tomcat, which implies java. In Java it is extremely difficult to avoid character mis-encoding problems due to the default constructor and methods which convert from bytes to chars (or byte[] to String ). These default methods will use the System character encoding which is almost certianly not what you want. Every single call to a constructor or method that takes a byte/byte[] and returns a String/char/char[] or visa-versa has to be inspected and the correct character encoding used consistantly. Including constructors of objects which later on do the transcoding (like Reader or Writer objects). This can be extremely difficult or impossiblle if using ANY third party libraries that have this problem internally. To properly handle Unicode data end to end is a vast task and often requires experiese and cooperation across all branches of an orgranization.

RE: Recover from Illegal character error? - Added by Anonymous over 14 years ago

Legacy ID: #8413500 Legacy Poster: Meikel Bisping (mbisping)

we used Tomcat (java servlets),dbforms,postgres and saxon with ISO-8859-1. It worked fine also with German Umlaute until we introduced a new function where people copy text from Word into the browser form with EUR-Sign, bullets and stuff. I thought changing all encodings to UTF would help - it works if you enter text into the form manually, but inserting text from clipboard (originating from Word) still causes problems. I've read about an argument accept-charset='UTF-8' in the <form tag or alternatively I'll try to set the CharacterEncodingFilter in the web.xml to cp1252 to see if that helps.

(1-5/5)

Please register to reply

Project

Profile

Help

Saxon