Bug #6018
closedUnicodeString - indexWhere does not start from the expected position
100%
Description
There is a performance regression when escaping special characters with the UnicodeString
class.
In the method writeEscape
of the XMLEmitter
class,
while (segstart < clength) {
// find a maximal sequence of "ordinary" characters
long found = chars.indexWhere(special, segstart);
that calls the indexWhere
method of the UnicodeString
class,
public long indexWhere(IntPredicate predicate, long from) {
IntIterator iter = codePoints();
long i = 0;
while (iter.hasNext()) {
int ch = iter.next();
if (i >= from && predicate.test(ch)) {
return i;
}
i++;
}
return -1;
}
the character sequence is searched from the beginning for each segment, whereas it should be searched from the position from
. This is O(n2) and painful for large text nodes with many escapable characters (e.g. escaped XML).
This is a quick suggestion:
public long indexWhere(IntPredicate predicate, long from) {
for (long i = from; i < length(); ++i) {
int ch = codePointAt(i);
if (predicate.test(ch)) {
return i;
}
}
return -1;
}
Updated by Michael Kay over 1 year ago
Good detective work. Thank you.
We have to be careful because for some implementations of UnicodeString
, codePointAt()
may be expensive. It may be appropriate to have different implementations for different subclasses.
Updated by Michael Kay over 1 year ago
In fact it looks to me as if all implementations of UnicodeString
other than WhitespaceString
and StringView
have a custom implementation of the indexWhere
method. I find it hard to imagine that WhitespaceString
is relevant here, but we need to look at StringView
.
Updated by Steven Dürrenmatt over 1 year ago
Looks like StringView
either relies on BMPString
, which has simplest implementation of codePointAt
(ordinary String charAt
), or Twine24
, which is also optimized.
Updated by Michael Kay over 1 year ago
I'm making the method in UnicodeString
abstract, and thereby ensuring that all subclasses have an appropriate implementation.
Updated by Michael Kay over 1 year ago
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch trunk added
- Applies to branch deleted (
11) - Fix Committed on Branch 12, trunk added
- Platforms .NET, Java added
Fixed on the 12.x and main branches; decided not to change 11.x
Updated by O'Neil Delpratt over 1 year ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.3 added
Bug fix applied in the Saxon 12.3 maintenance release.
Please register to edit this issue