Bug #6018
closedUnicodeString - indexWhere does not start from the expected position
100%
Description
There is a performance regression when escaping special characters with the UnicodeString
class.
In the method writeEscape
of the XMLEmitter
class,
while (segstart < clength) {
// find a maximal sequence of "ordinary" characters
long found = chars.indexWhere(special, segstart);
that calls the indexWhere
method of the UnicodeString
class,
public long indexWhere(IntPredicate predicate, long from) {
IntIterator iter = codePoints();
long i = 0;
while (iter.hasNext()) {
int ch = iter.next();
if (i >= from && predicate.test(ch)) {
return i;
}
i++;
}
return -1;
}
the character sequence is searched from the beginning for each segment, whereas it should be searched from the position from
. This is O(n2) and painful for large text nodes with many escapable characters (e.g. escaped XML).
This is a quick suggestion:
public long indexWhere(IntPredicate predicate, long from) {
for (long i = from; i < length(); ++i) {
int ch = codePointAt(i);
if (predicate.test(ch)) {
return i;
}
}
return -1;
}
Updated by Michael Kay 12 months ago
Good detective work. Thank you.
We have to be careful because for some implementations of UnicodeString
, codePointAt()
may be expensive. It may be appropriate to have different implementations for different subclasses.
Updated by Michael Kay 12 months ago
In fact it looks to me as if all implementations of UnicodeString
other than WhitespaceString
and StringView
have a custom implementation of the indexWhere
method. I find it hard to imagine that WhitespaceString
is relevant here, but we need to look at StringView
.
Updated by Steven Dürrenmatt 12 months ago
Looks like StringView
either relies on BMPString
, which has simplest implementation of codePointAt
(ordinary String charAt
), or Twine24
, which is also optimized.
Updated by Michael Kay 12 months ago
I'm making the method in UnicodeString
abstract, and thereby ensuring that all subclasses have an appropriate implementation.
Updated by Michael Kay 12 months ago
- Status changed from New to Resolved
- Assignee set to Michael Kay
- Applies to branch trunk added
- Applies to branch deleted (
11) - Fix Committed on Branch 12, trunk added
- Platforms .NET, Java added
Fixed on the 12.x and main branches; decided not to change 11.x
Updated by O'Neil Delpratt 10 months ago
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in Maintenance Release 12.3 added
Bug fix applied in the Saxon 12.3 maintenance release.
Please register to edit this issue