Bug #1852
closedArrayIndexOutOfBounds doing Unicode normalization
100%
Description
Reported on saxon-help list by High Cayless:
Hi, I'm running into a sporadic exception when converting NFC strings to NFD, using Saxon 9.5.1.1 HE:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at net.sf.saxon.tree.util.FastStringBuffer.insertWideChar(FastStringBuffer.java:407)
at net.sf.saxon.serialize.codenorm.Normalizer.internalDecompose(Normalizer.java:152)
at net.sf.saxon.serialize.codenorm.Normalizer.normalize(Normalizer.java:83)
at net.sf.saxon.functions.NormalizeUnicode.normalize(NormalizeUnicode.java:99)
at net.sf.saxon.functions.NormalizeUnicode.evaluateItem(NormalizeUnicode.java:35)
at net.sf.saxon.functions.NormalizeUnicode.evaluateItem(NormalizeUnicode.java:23)
I've played around a bit with varying strings, and it seems only to trigger in certain circumstances—maybe certain strings are sneaking by net.sf.saxon.tree.util.FastStringBuffer's ensureCapacity method?
Attached is a string that will consistently trigger the bug when passed to net.sf.saxon.serialize.codenorm.Normalizer's normalize method. Normalizer is set up to convert to NFD.
This seems like a regression, as I've been using Saxon on these files for a few years now, and I've only seen it since upgrading.
Thanks,
Hugh
The string in question is:
ἱερου ιβ´ ἔτουσ λγ Παῦνι κα μεμέτρηκεν εἰσ τὸν ἐν Διὸσ πόλει τῆι μεγάληι θησαυρὸν εἰσ τὴν ἐπιγραφὴν τοῦ τρίτου καὶ λ ἔτουσ ὑπὲρ τοῦ τόπου Σῶσασ Ἀλεξάνδρου κριθῆσ πέντε
Updated by Michael Kay over 10 years ago
Initial MHK response:
Thanks for reporting it and for providing a simple repro.
The code has indeed changed here since 9.4; it used to read
for (int i=used; i>index; i--) {
array[i+1] = array[i-1];
}
and this was changed to
System.arraycopy(array, index, array, index + 1 + 1, used - index);
by an automatic IntelliJ refactoring. I haven't yet worked out why this rewrite is incorrect. Debugging is complicated by the fact that it's a Heisenbug; it doesn't occur in the debugger because FastStringBuffer.toString() calls condense() which changes the value.
The code appears to be unnecessary in the case where (as here) the "insertion" is at the end of the string,and the simplest fix may be to avoid doing it in that situation.
Michael Kay
Updated by Michael Kay over 10 years ago
Switching the code back to the 9.4 version does not cure the problem: the rewritten code does appear to be equivalent.
Running under 9.4, the Unicode decomposition is not generating any astral characters. 9.4 generates 55296 where 9.5 generates 65909. So it seems that as well as a bug in FastStringBuffer.insertWideChar() (present in both releases) we might have a new bug in Unicode denormalization. This could be related to bug 1842.
No, Unicode normalization is generating the same characters. What has changed is that in 9.4 it used a StringBuffer, in 9.5 it is using a FastStringBuffer. So the incorrect code in FastStringBuffer was always there, it just wasn't being used during Unicode normalization. So this explains the regression.
Updated by Michael Kay over 10 years ago
- Status changed from In Progress to Resolved
Fixed FastStringBuffer.insertWideChar() by delegating to FastStringBuffer.appendWideChar() in the case where the insertion position is at the end of the string.
Patch committed on the 9.5 and 9.6 branches (but not 9.4, though the problem is also latent there and in earlier releases).
Updated by Michael Kay over 10 years ago
Changed the fix - it addressed the symptoms but not the cause. The cause is that too many chars are being shifted right to make room for the new char, which is caused by incrementing "used" before doing the shift. So the problem could also occur when inserting into the middle of the array. Changed the fix to increment "used" after doing the arraycopy, as done by the insert() method.
Updated by O'Neil Delpratt over 10 years ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in version set to 9.5.1.2
Bug fix applied in the Saxon maintenance release 9.5.1.2
Please register to edit this issue