Project

Profile

Help

SaxonCS output encoding support in comparison with Saxon.NET

Added by Martin Honnen over 1 year ago

I think due to (default) restrictions in .NET 5/6 (core) of the supported Encodings it seems that SaxonCS.exe transform supports a lot of less encoding than Saxon.NET's Transform.exe does, e.g. with Transform.exe of Saxon 10.8 on my machine I can use e.g. !encoding=Windows-1252 or !encoding=Windows-1252 without problems while SaxonCS gives an error like e.g. Unknown encoding requested: Windows-1252.

This should be fixable (I hope) by SaxonCS.exe doing Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);. Documentation explaining that is in https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getencodings?view=net-6.0.

What do you think? Is the currently limited support of a variety of encodings due to .NET core default limitations a deliberate decision or just something you weren't aware of?


Replies (13)

Please register to reply

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 1 year ago

Can't users with specialist requirements for obscure encodings register them themselves?

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 1 year ago

When calling the SaxonCS API with their own .NET code, yes, I think so, but I don't see or at least know how to do that for the SaxonCS.exe transform command line tool.

SaxonCS output encoding support in comparison with Saxon.NET - Added by Norm Tovey-Walsh over 1 year ago

Saxonica Developer Community writes:

What do you think? Is the currently limited support of a variety of
encodings due to .NET core default limitations a deliberate decision
or just something you weren't aware of?

It’s just an oversight, I’m sure. I notice that I stumbled over the
RegisterProvider approach when I was writing unit tests for the XML
Resolver:

[Test]
public void testDataUricharset() {
string href = "greek.txt";
string baseuri = "http://example.com/";

// We don't do this in the resolver, but we do it here to demonstrate that
// it will work in applications that use the resolver.
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

string line = null;
ResolvedResource result = resolver.ResolveUri(href, baseuri);
using (StreamReader reader = new StreamReader(result.GetInputStream())) {
line = reader.ReadLine();
}
Assert.AreEqual("ΎχΎ", line);
}

I can’t think of any reason why SaxonCS shouldn’t do this. Or have an
option to do it, at least.

Be seeing you,
norm

--
Norman Tovey-Walsh
https://nwalsh.com/

My life has a superb cast but I can’t figure out the plot.--Ashleigh
Brilliant

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 1 year ago

It seems to be more complicated that just doing Encoding.RegisterProvider(CodePagesEncodingProvider.Instance), the code

using Saxon.Api;
using System.Text;

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

try
{
    Console.WriteLine(Encoding.GetEncoding("Windows-1252"));
}
catch (Exception e)
{
    Console.WriteLine(e.Message);
}


Processor processor = new();

var compiler = processor.NewXQueryCompiler();

var query = compiler.Compile(@"declare namespace output = 'http://www.w3.org/2010/xslt-xquery-serialization'; declare option output:method 'text'; declare option output:item-separator '
'; declare option output:encoding 'Windows-1252'; (1 to 2) ! 'Test. Test.,Test'");

using (var resultStream = File.OpenWrite("test1.csv"))
{
    query.Load().Run(processor.NewSerializer(resultStream));
}

shows the encoding is found but SaxonCS nevertheless gives an error:

Saxon.Api.SaxonApiException
  HResult=0x80131500
  Nachricht = Unknown encoding requested: Windows-1252
  Quelle = SaxonCS
  Stapelüberwachung:
   bei Saxon.Hej.s9api.Serializer.getReceiver(PipelineConfiguration pipe, SerializationProperties params)
   bei Saxon.Hej.s9api.XQueryEvaluator.getDestinationReceiver(Destination destination)
   bei Saxon.Hej.s9api.XQueryEvaluator.run(Destination destination)
   bei Saxon.Hej.s9api.XQueryEvaluator.run()
   bei Saxon.Api.XQueryEvaluator.Run(IDestination destination)
   bei Program.<Main>$(String[] args) in C:\Users\marti\source\repos\SaxonCSEncodingTest1\Program.cs: Zeile24

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 1 year ago

Meanwhile, XmlPrime with .NET 6, when using Encoding.RegisterProvider(CodePagesEncodingProvider.Instance), does run such a query fine:

using XmlPrime;
using System.Text;

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

XQuery query = XQuery.Compile(@"xquery version '3.1'; declare namespace output = 'http://www.w3.org/2010/xslt-xquery-serialization'; declare option output:method 'text'; declare option output:item-separator '&#10;'; declare option output:encoding 'Windows-1252'; (1 to 2) ! 'Test.&#160;Test.,Test'", new XQuerySettings() { XQueryVersion = XQueryVersion.XQuery31 });

using (var resultStream = File.OpenWrite("result-test1.txt"))
{
    query.Serialize(resultStream);
}

Without the Encoding.RegisterProvider(CodePagesEncodingProvider.Instance) it says that Windows-1252 is an unsupported encoding. But using Encoding.RegisterProvider(CodePagesEncodingProvider.Instance) enables XmlPrime to use such an encoding (at least on Windows) with .NET 6/.NET Core.

What does SaxonCS do differently that even with Encoding.RegisterProvider(CodePagesEncodingProvider.Instance) it doesn't know/find the encoding?

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 1 year ago

Saxon's CharacterSetFactory includes a small set of character sets for which Saxon has built-in knowledge. In the Java version of the class, a request for an encoding outside that set falls back to using information from the JDK. That fallback code is absent in SaxonCS.

The reason we care is that we have to produce fallback serialization as XML character references for characters that aren't in the target character set. In the past that involved throwing and catching exceptions; today it involves calling CharacterSetEncoder.canEncode() potentially for every character, which is a heavy overhead even though we cache the result.

We could implement the kind of thing using an EncoderFallback implementation on .NET, but it's a very fiddly API (I've just re-read the specs and it's very hard to see exactly how to replace x1234 with &#x1234;), and it wasn't a strong enough requirement to make it in the first cut.

Does XmlPrime handle the fallback processing of unencodable characters correctly, as a matter of interest?

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 1 year ago

Unless someone produces a better answer, I think we can use the logic of the JavaCharacterSet (which builds a local table of which characters are encodable and which aren't), with the difference that we have to implement Encoding.canEncode() ourselves: we do this by passing a single character to the Encoder, with a fallback action that returns a substitute string, and checking whether we get the substitute string back.

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 1 year ago

There's a helpful response to my StackOverflow question which should provide a solution, though it may need adjustment for handling surrogate pairs.

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 1 year ago

Actually, I think I asked the wrong question.

We don't actually want to write an EncodingFallback handler that injects XML entity references (because we also need to inject things like &lt; on the same pass through the data). What we really need to do is to implement the method

static bool CanEncode(Encoding, char) {...}

which returns true if the character is encodable, false if not.

To do that, we have to pass a single-character string to the Encoder, and can follow the example at

https://docs.microsoft.com/en-us/dotnet/api/system.text.encoderreplacementfallback?view=net-6.0

to generate a magic response (e.g. "unencodable") as the replacement string for unencodable characters. We can then plug this into the existing logic for the Java class JavaCharacterSet.

RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 1 year ago

Great. Just to be sure when I should go look for it, implemented on the 12 branch, i.e. the next major release, or on the 11 branch as well?

    (1-13/13)

    Please register to reply