Project

Profile

Help

Bug #4625

closed

Representation of xs:base64Binary and xs:hexBinary

Added by Michael Kay almost 4 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2020-06-30
Due date:
% Done:

100%

Estimated time:
Applies to JS Branch:
2
Fix Committed on JS Branch:
2
Fixed in JS Release:
SEF Generated with:
Platforms:
Company:
-
Contact person:
-
Additional contact persons:
-

Description

Binary values are represented internally as character strings using codepoints 0-255 to represent octet values. This is inefficient (a character occupies two octets); these days the obvious representation is Uint8Array.

Actions #1

Updated by Michael Kay almost 4 years ago

  • Status changed from New to Resolved
  • Fix Committed on JS Branch 2.0 added

Changed the representation to Uint8Array (which is also more convenient for manipulation -- when it comes to implementing the EXPath binary module).

Actions #2

Updated by Michael Kay almost 4 years ago

Implementing this revealed a problem with collation keys; I strongly suspect the existing implementation isn't fully conformant, at least in ensuring that the ordering of collation keys corresponds to the ordering of the corresponding strings. (The spec is pretty confused about whether collation keys allow ordering, but I think the intended behaviour is that they do).

Internally, our collation objects have a collationKey() function that returns a string, but fn:collation-key() returns an xs:base64Binary value. I think that we're just using the 16-bit codepoints of the string as "octets" in the internal representation of the base64Binary value, which means that converting the collation key to a string is likely to crash the base64 encoding logic.

So what should we do instead? The obvious answer is UTF-8 encoding. It seems that "true" UTF8 preserves codepoint collation ordering. But how do we do the encoding? Javascript TextEncoder appears to be recommended. So collation keys for strings (assuming codepoint collation) are now the UTF8 encoding of the string.

Actions #3

Updated by Community Admin about 3 years ago

  • Applies to JS Branch 2 added
  • Applies to JS Branch deleted (2.0)
Actions #4

Updated by Community Admin about 3 years ago

  • Fix Committed on JS Branch 2 added
  • Fix Committed on JS Branch deleted (2.0)
Actions #5

Updated by Debbie Lockett about 3 years ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in JS Release set to Saxon-JS 2.1

Bug fix applied in the Saxon-JS 2.1 maintenance release.

Please register to edit this issue

Also available in: Atom PDF Tracking page