Bug #5838: ArrayIndexOutOfBounds with whitespace coming from StaxSource - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #5838

closed

ArrayIndexOutOfBounds with whitespace coming from StaxSource

Added by Michael Kay about 2 years ago. Updated almost 2 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Michael Kay

Category:

Internals

Sprint/Milestone:

Start date:

2023-01-19

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

11, 12, trunk

Fix Committed on Branch:

11, 12, trunk

Fixed in Maintenance Release:

11.5, 12.1

Platforms:

.NET, Java

Description

See https://saxonica.plan.io/boards/3/topics/9228?pn=1

The stacktrace starts

java.lang.ArrayIndexOutOfBoundsException: Index 8192 out of bounds for length 8192
    at net.sf.saxon.str.StringTool.compress(StringTool.java:267) ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.str.CompressedWhitespace.compressWS(CompressedWhitespace.java:52) ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.str.StringTool.compress(StringTool.java:277) ~[Saxon-HE-11.4.jar:na]
    at net.sf.saxon.pull.StaxBridge.getStringValue(StaxBridge.java:456) ~[Saxon-HE-11.4.jar:na]

Actions

Copy link

Updated by Michael Kay about 2 years ago

The cause looks pretty clear from the stack trace: StringTool#277 is calling CompressedWhitespace.compressWS(in, offset, end); but the method declaration is public static UnicodeString compressWS(char[] in, int start, int len) so we're clearly passing an "end" offset where a length is expected.

Which obviously raises questions like (a) why is it only happening on this path, and (b) how can we write a regression test to catch the failure and make sure it's fixed.

(This confusion between end offset and length is a real elephant trap. Both conventions are widely used in Java APIs).

Actions

Copy link

Updated by Michael Kay about 2 years ago

The reason the problem doesn't arise when a push parser is used is that on that path we only compress whitespace after assembling multiple text fragments in a buffer, and as a result it always calls compress() with a start offset of 0, in which case end and length are interchangeable.

As for testing, we only run a rather small number of tests with a pull parser. I'll take a close look at some of those tests to see what path they are taking.

Actions

Copy link

Updated by Michael Kay about 2 years ago

The tests we are running use Woodstox, which does indeed supply text in a 4000-character buffer -- but like the SAX parser, it's always returning 0 as the result of reader.getTextStart(), which means we're again setting start=0 and therefore not seeing any distinction between end and length.

I think the best way of tackling this is going to be to write a unit test that invokes the compression methods directly, rather than trying to trigger an XML parser to set the right input conditions.

Actions

Copy link