Bug #4625
closed
Representation of xs:base64Binary and xs:hexBinary
Fix Committed on JS Branch:
2
Description
Binary values are represented internally as character strings using codepoints 0-255 to represent octet values. This is inefficient (a character occupies two octets); these days the obvious representation is Uint8Array.
- Status changed from New to Resolved
- Fix Committed on JS Branch 2.0 added
Changed the representation to Uint8Array (which is also more convenient for manipulation -- when it comes to implementing the EXPath binary module).
Implementing this revealed a problem with collation keys; I strongly suspect the existing implementation isn't fully conformant, at least in ensuring that the ordering of collation keys corresponds to the ordering of the corresponding strings. (The spec is pretty confused about whether collation keys allow ordering, but I think the intended behaviour is that they do).
Internally, our collation objects have a collationKey() function that returns a string, but fn:collation-key() returns an xs:base64Binary value. I think that we're just using the 16-bit codepoints of the string as "octets" in the internal representation of the base64Binary value, which means that converting the collation key to a string is likely to crash the base64 encoding logic.
So what should we do instead? The obvious answer is UTF-8 encoding. It seems that "true" UTF8 preserves codepoint collation ordering. But how do we do the encoding? Javascript TextEncoder appears to be recommended. So collation keys for strings (assuming codepoint collation) are now the UTF8 encoding of the string.
- Applies to JS Branch 2 added
- Applies to JS Branch deleted (
2.0)
- Fix Committed on JS Branch 2 added
- Fix Committed on JS Branch deleted (
2.0)
- Status changed from Resolved to Closed
- % Done changed from 0 to 100
- Fixed in JS Release set to Saxon-JS 2.1
Bug fix applied in the Saxon-JS 2.1 maintenance release.
Please register to edit this issue
Also available in: Atom
PDF
Tracking page