Project

Profile

Help

SaxonJS 1.0.0 base64_decode doesn't use UTF-8

Added by Johan Walters almost 8 years ago

We store our SEF files in base64 url encoding, such that they can be stored inside a json file.

When performing base64 decoding using UTF-8, the checkum PI comes out properly (@< ? * 1913af46 ? > @) [where *=sigma]. However, SaxonJS finds <@?Σ 1913af46?>@, which would be valid according to base64 decoding based on ISO-8859-1, ISO-8859-15 or Windows-1252.

I think SaxonJS should revise the used decoding algorithm to use UTF-8 instead.

(We will try performing our base64 encoding with ISO-8859-1 as a workaround).

Kind regards,

Johan Walters


Replies (4)

Please register to reply

RE: SaxonJS 1.0.0 base64_decode doesn't use UTF-8 - Added by Michael Kay almost 8 years ago

I've tried to edit the post so the relevant snippets are visible - sorry if I've destroyed something in the process.

Saxon exports the SEF file in UTF-8 encoding, so the "sigma" character is represented by the two octets CEA3. If you encode that octet stream as base64 and then decode it again as an octet stream, you should get the same octets back, CEA3. If you get anything different, then you've done it wrong.

It looks to me as if you have misunderstood what the Saxon-JS function base64_decode does (it was written for internal use so perhaps it's not well documented). The function takes a base64 string and turns it into a sequence of octets, where the sequence of octets is represented as a Javascript string using characters in the range 0-255 to represent individual octets. This is the representation of binary values used by the XdmBinary class, and it's not intended to be exposed externally. Iif you encode the original UTF-8 sigma character as base 64, and then use base64_decode to decode it, it will come back as the two octets CEA3, which if you then (mis-)interpret the string as an ordinary Javascript string, represents something entirely different: the two characters 00CE, 00A3, that is capital-I-with-circumflex followed by pound-currency-sign.

So it seems to me you have misunderstood what base64_decode does. It's manipulating binary octets, and has no knowledge of character encodings.

(Incidentally part of the motivation of using the sigma character in the SEF was so that encoding errors of this kind would show up in a recognizable way, rather than leading to unpredictable data corruptions.)

RE: SaxonJS 1.0.0 base64_decode doesn't use UTF-8 - Added by Johan Walters almost 8 years ago

We actually don't use base64_decode directly, rather we supply base64 payloads into URIs. I did mistakenly assume that this function would be the cause, but it must be in the interpretation of the result: somehow Saxon-JS treats the encoding of URIs differently in 1.0.0 (I think).

In the beta version of Saxon-JS the only way to load a SEF successfully was via 'stylesheetLocation' (we didn't manage to use 'stylesheetText'). Therefore we call transform with stylesheetLocation using a URI with the SEF embedded in base64: "data:text/xml;base64,PHBhY2thZ2UgeG1sbnM9Imh...." etc. This worked fine.

Now if we do the same with version 1.0.0, an error is thrown, caused by a parseeror: "error on line 11394 at column 4: ParsePI: PI Î space expected". If we debug the obtained string, we see (I-circumflex pound).

It seems that after base64 decoding, the string is interpreted as a Javascript string instead of UTF-8, as it was before.

Kind regards,

Johan Walters

RE: SaxonJS 1.0.0 base64_decode doesn't use UTF-8 - Added by Michael Kay almost 8 years ago

Have you tried setting the charset in the data URI?

@"data:text/xml;charset=utf-8;base64,PHBhY2thZ2UgeG1sbnM9Imh...."@

(I don't think Saxon is decoding the data in this URI - I think we're just passing it to XmlHttpRequest)

RE: SaxonJS 1.0.0 base64_decode doesn't use UTF-8 - Added by Johan Walters almost 8 years ago

Thank you, that worked perfectly :-)

    (1-4/4)

    Please register to reply