Bug #6018: UnicodeString - indexWhere does not start from the expected position - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #6018

closed

UnicodeString - indexWhere does not start from the expected position

Added by Steven Dürrenmatt over 1 year ago. Updated over 1 year ago.

Status:

Closed

Priority:

Low

Assignee:

Michael Kay

Category:

Performance

Sprint/Milestone:

Start date:

2023-05-05

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

12, trunk

Fix Committed on Branch:

12, trunk

Fixed in Maintenance Release:

12.3

Platforms:

.NET, Java

Description

There is a performance regression when escaping special characters with the UnicodeString class.

In the method writeEscape of the XMLEmitter class,

while (segstart < clength) {
    // find a maximal sequence of "ordinary" characters
    long found = chars.indexWhere(special, segstart);

that calls the indexWhere method of the UnicodeString class,

public long indexWhere(IntPredicate predicate, long from) {
    IntIterator iter = codePoints();
    long i = 0;
    while (iter.hasNext()) {
        int ch = iter.next();
        if (i >= from && predicate.test(ch)) {
            return i;
        }
        i++;
    }
    return -1;
}

the character sequence is searched from the beginning for each segment, whereas it should be searched from the position from. This is O(n²) and painful for large text nodes with many escapable characters (e.g. escaped XML).

This is a quick suggestion:

public long indexWhere(IntPredicate predicate, long from) {
    for (long i = from; i < length(); ++i) {
        int ch = codePointAt(i);
        if (predicate.test(ch)) {
            return i;
        }
    }
    return -1;
}

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Saxon

Bug #6018

UnicodeString - indexWhere does not start from the expected position

Updated by Michael Kay over 1 year ago

Updated by Michael Kay over 1 year ago

Updated by Steven Dürrenmatt over 1 year ago

Updated by Michael Kay over 1 year ago

Updated by Michael Kay over 1 year ago

Updated by O'Neil Delpratt over 1 year ago