Bug #5838


ArrayIndexOutOfBounds with whitespace coming from StaxSource

Added by Michael Kay 11 months ago. Updated 9 months ago.

Start date:
Due date:
% Done:


Estimated time:
Legacy ID:
Applies to branch:
11, 12, trunk
Fix Committed on Branch:
11, 12, trunk
Fixed in Maintenance Release:
.NET, Java



The stacktrace starts

java.lang.ArrayIndexOutOfBoundsException: Index 8192 out of bounds for length 8192
    at net.sf.saxon.str.StringTool.compress( ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.str.CompressedWhitespace.compressWS( ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.str.StringTool.compress( ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.pull.StaxBridge.getStringValue( ~[Saxon-HE-11.4.jar:na]
Actions #1

Updated by Michael Kay 11 months ago

The cause looks pretty clear from the stack trace: StringTool#277 is calling CompressedWhitespace.compressWS(in, offset, end); but the method declaration is public static UnicodeString compressWS(char[] in, int start, int len) so we're clearly passing an "end" offset where a length is expected.

Which obviously raises questions like (a) why is it only happening on this path, and (b) how can we write a regression test to catch the failure and make sure it's fixed.

(This confusion between end offset and length is a real elephant trap. Both conventions are widely used in Java APIs).

Actions #2

Updated by Michael Kay 11 months ago

The reason the problem doesn't arise when a push parser is used is that on that path we only compress whitespace after assembling multiple text fragments in a buffer, and as a result it always calls compress() with a start offset of 0, in which case end and length are interchangeable.

As for testing, we only run a rather small number of tests with a pull parser. I'll take a close look at some of those tests to see what path they are taking.

Actions #3

Updated by Michael Kay 11 months ago

The tests we are running use Woodstox, which does indeed supply text in a 4000-character buffer -- but like the SAX parser, it's always returning 0 as the result of reader.getTextStart(), which means we're again setting start=0 and therefore not seeing any distinction between end and length.

I think the best way of tackling this is going to be to write a unit test that invokes the compression methods directly, rather than trying to trigger an XML parser to set the right input conditions.

Actions #4

Updated by Michael Kay 11 months ago

  • Status changed from New to Resolved
  • Fix Committed on Branch 11, 12, trunk added
  • Platforms .NET added

Added unit tests in internaltests/CompressionTests.

Bug fixed.

Actions #5

Updated by O'Neil Delpratt 10 months ago

  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 11.5 added

Bug fix applied in the Saxon 11.5 maintenance release.

Actions #6

Updated by O'Neil Delpratt 9 months ago

  • Status changed from Resolved to Closed
  • Fixed in Maintenance Release 12.1 added

Bug fix applied in the Saxon 12.1 maintenance release.

Please register to edit this issue

Also available in: Atom PDF