Project

Profile

Help

Bug #6018

closed

UnicodeString - indexWhere does not start from the expected position

Added by Steven Dürrenmatt 12 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2023-05-05
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
12, trunk
Fix Committed on Branch:
12, trunk
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

There is a performance regression when escaping special characters with the UnicodeString class.

In the method writeEscape of the XMLEmitter class,

while (segstart < clength) {
    // find a maximal sequence of "ordinary" characters
    long found = chars.indexWhere(special, segstart);

that calls the indexWhere method of the UnicodeString class,

public long indexWhere(IntPredicate predicate, long from) {
    IntIterator iter = codePoints();
    long i = 0;
    while (iter.hasNext()) {
        int ch = iter.next();
        if (i >= from && predicate.test(ch)) {
            return i;
        }
        i++;
    }
    return -1;
}

the character sequence is searched from the beginning for each segment, whereas it should be searched from the position from. This is O(n2) and painful for large text nodes with many escapable characters (e.g. escaped XML).

This is a quick suggestion:

public long indexWhere(IntPredicate predicate, long from) {
    for (long i = from; i < length(); ++i) {
        int ch = codePointAt(i);
        if (predicate.test(ch)) {
            return i;
        }
    }
    return -1;
}
Actions #1

Updated by Michael Kay 12 months ago

Good detective work. Thank you.

We have to be careful because for some implementations of UnicodeString, codePointAt() may be expensive. It may be appropriate to have different implementations for different subclasses.

Actions #2

Updated by Michael Kay 12 months ago

In fact it looks to me as if all implementations of UnicodeString other than WhitespaceString and StringView have a custom implementation of the indexWhere method. I find it hard to imagine that WhitespaceString is relevant here, but we need to look at StringView.

Actions #3

Updated by Steven Dürrenmatt 12 months ago

Looks like StringView either relies on BMPString, which has simplest implementation of codePointAt (ordinary String charAt), or Twine24, which is also optimized.

Actions #4

Updated by Michael Kay 12 months ago

I'm making the method in UnicodeString abstract, and thereby ensuring that all subclasses have an appropriate implementation.

Actions #5

Updated by Michael Kay 12 months ago

  • Status changed from New to Resolved
  • Assignee set to Michael Kay
  • Applies to branch trunk added
  • Applies to branch deleted (11)
  • Fix Committed on Branch 12, trunk added
  • Platforms .NET, Java added

Fixed on the 12.x and main branches; decided not to change 11.x

Actions #6

Updated by O'Neil Delpratt 10 months ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 12.3 added

Bug fix applied in the Saxon 12.3 maintenance release.

Please register to edit this issue

Also available in: Atom PDF