Project

Profile

Help

Bug #1852

closed

ArrayIndexOutOfBounds doing Unicode normalization

Added by Michael Kay over 10 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
XPath conformance
Sprint/Milestone:
-
Start date:
2013-07-23
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Reported on saxon-help list by High Cayless:

Hi, I'm running into a sporadic exception when converting NFC strings to NFD, using Saxon 9.5.1.1 HE:

java.lang.ArrayIndexOutOfBoundsException

at java.lang.System.arraycopy(Native Method)

at net.sf.saxon.tree.util.FastStringBuffer.insertWideChar(FastStringBuffer.java:407)

at net.sf.saxon.serialize.codenorm.Normalizer.internalDecompose(Normalizer.java:152)

at net.sf.saxon.serialize.codenorm.Normalizer.normalize(Normalizer.java:83)

at net.sf.saxon.functions.NormalizeUnicode.normalize(NormalizeUnicode.java:99)

at net.sf.saxon.functions.NormalizeUnicode.evaluateItem(NormalizeUnicode.java:35)

at net.sf.saxon.functions.NormalizeUnicode.evaluateItem(NormalizeUnicode.java:23)

I've played around a bit with varying strings, and it seems only to trigger in certain circumstances—maybe certain strings are sneaking by net.sf.saxon.tree.util.FastStringBuffer's ensureCapacity method?

Attached is a string that will consistently trigger the bug when passed to net.sf.saxon.serialize.codenorm.Normalizer's normalize method. Normalizer is set up to convert to NFD.

This seems like a regression, as I've been using Saxon on these files for a few years now, and I've only seen it since upgrading.

Thanks,

Hugh

The string in question is:

ἱερου ιβ´ ἔτουσ λγ Παῦνι κα μεμέτρηκεν εἰσ τὸν ἐν Διὸσ πόλει τῆι μεγάληι θησαυρὸν εἰσ τὴν ἐπιγραφὴν τοῦ τρίτου καὶ λ ἔτουσ ὑπὲρ τοῦ τόπου Σῶσασ Ἀλεξάνδρου κριθῆσ πέντε

Actions #1

Updated by Michael Kay over 10 years ago

Initial MHK response:

Thanks for reporting it and for providing a simple repro.

The code has indeed changed here since 9.4; it used to read

       for (int i=used; i>index; i--) {

           array[i+1] = array[i-1];

       }

and this was changed to

       System.arraycopy(array, index, array, index + 1 + 1, used - index);

by an automatic IntelliJ refactoring. I haven't yet worked out why this rewrite is incorrect. Debugging is complicated by the fact that it's a Heisenbug; it doesn't occur in the debugger because FastStringBuffer.toString() calls condense() which changes the value.

The code appears to be unnecessary in the case where (as here) the "insertion" is at the end of the string,and the simplest fix may be to avoid doing it in that situation.

Michael Kay

Actions #2

Updated by Michael Kay over 10 years ago

Switching the code back to the 9.4 version does not cure the problem: the rewritten code does appear to be equivalent.

Running under 9.4, the Unicode decomposition is not generating any astral characters. 9.4 generates 55296 where 9.5 generates 65909. So it seems that as well as a bug in FastStringBuffer.insertWideChar() (present in both releases) we might have a new bug in Unicode denormalization. This could be related to bug 1842.

No, Unicode normalization is generating the same characters. What has changed is that in 9.4 it used a StringBuffer, in 9.5 it is using a FastStringBuffer. So the incorrect code in FastStringBuffer was always there, it just wasn't being used during Unicode normalization. So this explains the regression.

Actions #3

Updated by Michael Kay over 10 years ago

  • Status changed from In Progress to Resolved

Fixed FastStringBuffer.insertWideChar() by delegating to FastStringBuffer.appendWideChar() in the case where the insertion position is at the end of the string.

Patch committed on the 9.5 and 9.6 branches (but not 9.4, though the problem is also latent there and in earlier releases).

Actions #4

Updated by Michael Kay over 10 years ago

Changed the fix - it addressed the symptoms but not the cause. The cause is that too many chars are being shifted right to make room for the new char, which is caused by incrementing "used" before doing the shift. So the problem could also occur when inserting into the middle of the array. Changed the fix to increment "used" after doing the arraycopy, as done by the insert() method.

Actions #5

Updated by O'Neil Delpratt over 10 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in version set to 9.5.1.2

Bug fix applied in the Saxon maintenance release 9.5.1.2

Please register to edit this issue

Also available in: Atom PDF