Bug #6018: UnicodeString - indexWhere does not start from the expected position - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #6018

closed

UnicodeString - indexWhere does not start from the expected position

Added by Steven Dürrenmatt over 1 year ago. Updated over 1 year ago.

Status:

Closed

Priority:

Low

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2023-05-05

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

12, trunk

Fix Committed on Branch:

12, trunk

Fixed in Maintenance Release:

12.3

Platforms:

.NET, Java

Description

There is a performance regression when escaping special characters with the UnicodeString class.

In the method writeEscape of the XMLEmitter class,

while (segstart < clength) {
    // find a maximal sequence of "ordinary" characters
    long found = chars.indexWhere(special, segstart);

that calls the indexWhere method of the UnicodeString class,

public long indexWhere(IntPredicate predicate, long from) {
    IntIterator iter = codePoints();
    long i = 0;
    while (iter.hasNext()) {
        int ch = iter.next();
        if (i >= from && predicate.test(ch)) {
            return i;
        }
        i++;
    }
    return -1;
}

the character sequence is searched from the beginning for each segment, whereas it should be searched from the position from. This is O(n²) and painful for large text nodes with many escapable characters (e.g. escaped XML).

This is a quick suggestion:

public long indexWhere(IntPredicate predicate, long from) {
    for (long i = from; i < length(); ++i) {
        int ch = codePointAt(i);
        if (predicate.test(ch)) {
            return i;
        }
    }
    return -1;
}

Actions

Copy link

Updated by Michael Kay over 1 year ago

Good detective work. Thank you.

We have to be careful because for some implementations of UnicodeString, codePointAt() may be expensive. It may be appropriate to have different implementations for different subclasses.

Actions

Copy link

Updated by Michael Kay over 1 year ago

In fact it looks to me as if all implementations of UnicodeString other than WhitespaceString and StringView have a custom implementation of the indexWhere method. I find it hard to imagine that WhitespaceString is relevant here, but we need to look at StringView.

Actions

Copy link

Updated by Steven Dürrenmatt over 1 year ago

Looks like StringView either relies on BMPString, which has simplest implementation of codePointAt (ordinary String charAt), or Twine24, which is also optimized.

Actions

Copy link