Project

Profile

Help

Bug #5297

closed

String handling: performance

Added by Michael Kay about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2022-02-09
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
11, trunk
Fix Committed on Branch:
11, trunk
Fixed in Maintenance Release:
Platforms:

Description

The forum post at https://saxonica.plan.io/boards/3/topics/8512 reveals some specific performance issues with string handling.

The escaping of HTML URIs is identified as a specific issue. Looking at the detail:

(a) The UnicodeBuilder class isn't designed to optimize character-by-character construction of a string, but it's often being used that way

(b) the specific case of HTML URI escaping would probably be better done using standard Java strings / stringBuilders anyway

(c) ZenoString.consolidate0() contains a call to verifySegmentLengths(), which is a diagnostic method to check the integrity of the data structure and should have been disabled in production code.

Actions #1

Updated by Michael Kay about 2 years ago

I've written a test in junit/performance/playground that calls the HTML URI escaper on a few thousand URI-like strings, each containing about 20 ASCII characters followed by one non-ASCII (but BMP) character. Baseline timing is 245.9ms.

Switching off the call to ZenoString.verifySegmentLengths() brings it down to 227.9ms.

Stepping through the execution in the debugger, we're iterating over the code points in the input; most of them are ASCII. We call UnicodeBuilder.append(int) on each one. This wraps the int in a UnicodeCodepoint object (not good news) and calls ZenoString.concat().

UnicodeBuilder is using an underlying ZenoString for convenience and reuse but it's intrinsically not ideal for the task because we don't need an immutable data structure here (we don't need each intermediate string value to be retained and accessible).

At one point UnicodeBuilder "did its own thing" rather than reusing ZenoString, but there was a lot of duplication of code (and bugs) so this was a simplifiication to reduce code and improve reliability, but it's not the best approach for performance.

Probably a rethink of UnicodeBuilder is needed.

Actions #2

Updated by Michael Kay about 2 years ago

I implemented a QuickUnicodeBuilder backed by an int[] array, and use this in place of UnicodeBuilder within the HTMLURLEscaper.

Measured time of my benchmark is down from 227.9ms to 14.6ms.

Clearly a big win for this use case; but I'm not sure the same will be true when we're concatenating existing strings rather than individual characters.

Actions #3

Updated by Michael Kay about 2 years ago

Having established that a different Unicode builder goes much faster, it then occurred to me that there's not really a strong case for using UnicodeString on this path - what happens if we revert to the 10.x code, using String rather than UnicodeString?

Rather to my surprise, the time shoots up again - to 234.6ms.

I can get it down again to 22.5ms simply by replacing the StringBuilder with the new QuickUnicodeBuilder. I have no idea why this should be, but the effect is consistent.

Actions #4

Updated by Michael Kay about 2 years ago

There are about 99 places we create a UnicodeBuilder, and from a quick glance, the vast majority of them fall into the pattern where we're building a fairly short string from individual characters.

I think the next thing to do is test a case like string-join() where we are assembling a string from substrings, to see how the new code compares with the old there.

Doing a string-join of around 48000 20-character strings (with some 2-byte characters but no 3-byte) using the existing UnicodeBuilder takes around 340ms. With the new QuickUnicodeBuilder, it's 12.6ms.

So I'm going to say, let's bite the bullet, and use the new code everywhere.

Actions #5

Updated by Michael Kay about 2 years ago

  • Status changed from New to Resolved
  • Applies to branch 11, trunk added
  • Fix Committed on Branch 11, trunk added

Done. I've also gone through all the places where we create a UnicodeBuilder and added an initial size allocation where feasible.

Noted that there are cases where we know all the characters will be ASCII (e.g when doing duration.getPrimitiveStringValue()) and we could optimize that further with a custom AsciiStringBuilder, but I haven't done that.

Actions #6

Updated by Debbie Lockett about 2 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 11.2 added

Bug fix applied in the Saxon 11.2 maintenance release.

Please register to edit this issue

Also available in: Atom PDF