SaxonCS output encoding support in comparison with Saxon.NET
Added by Martin Honnen over 2 years ago
I think due to (default) restrictions in .NET 5/6 (core) of the supported Encoding
s it seems that SaxonCS.exe transform
supports a lot of less encoding than Saxon.NET's Transform.exe
does, e.g. with Transform.exe
of Saxon 10.8 on my machine I can use e.g. !encoding=Windows-1252
or !encoding=Windows-1252
without problems while SaxonCS gives an error like e.g. Unknown encoding requested: Windows-1252
.
This should be fixable (I hope) by SaxonCS.exe doing Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
. Documentation explaining that is in https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getencodings?view=net-6.0.
What do you think? Is the currently limited support of a variety of encodings due to .NET core default limitations a deliberate decision or just something you weren't aware of?
Replies (13)
Please register to reply
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
Can't users with specialist requirements for obscure encodings register them themselves?
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 2 years ago
When calling the SaxonCS API with their own .NET code, yes, I think so, but I don't see or at least know how to do that for the SaxonCS.exe transform
command line tool.
SaxonCS output encoding support in comparison with Saxon.NET - Added by Norm Tovey-Walsh over 2 years ago
Saxonica Developer Community notifications@plan.io writes:
What do you think? Is the currently limited support of a variety of
encodings due to .NET core default limitations a deliberate decision
or just something you weren't aware of?
It’s just an oversight, I’m sure. I notice that I stumbled over the
RegisterProvider approach when I was writing unit tests for the XML
Resolver:
[Test]
public void testDataUricharset() {
string href = "greek.txt";
string baseuri = "http://example.com/";
// We don't do this in the resolver, but we do it here to demonstrate that
// it will work in applications that use the resolver.
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
string line = null;
ResolvedResource result = resolver.ResolveUri(href, baseuri);
using (StreamReader reader = new StreamReader(result.GetInputStream())) {
line = reader.ReadLine();
}
Assert.AreEqual("ΎχΎ", line);
}
I can’t think of any reason why SaxonCS shouldn’t do this. Or have an
option to do it, at least.
Be seeing you,
norm
--
Norman Tovey-Walsh ndw@nwalsh.com
https://nwalsh.com/
My life has a superb cast but I can’t figure out the plot.--Ashleigh
Brilliant
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 2 years ago
It seems to be more complicated that just doing Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
, the code
using Saxon.Api;
using System.Text;
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
try
{
Console.WriteLine(Encoding.GetEncoding("Windows-1252"));
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
Processor processor = new();
var compiler = processor.NewXQueryCompiler();
var query = compiler.Compile(@"declare namespace output = 'http://www.w3.org/2010/xslt-xquery-serialization'; declare option output:method 'text'; declare option output:item-separator ' '; declare option output:encoding 'Windows-1252'; (1 to 2) ! 'Test. Test.,Test'");
using (var resultStream = File.OpenWrite("test1.csv"))
{
query.Load().Run(processor.NewSerializer(resultStream));
}
shows the encoding is found but SaxonCS nevertheless gives an error:
Saxon.Api.SaxonApiException
HResult=0x80131500
Nachricht = Unknown encoding requested: Windows-1252
Quelle = SaxonCS
Stapelüberwachung:
bei Saxon.Hej.s9api.Serializer.getReceiver(PipelineConfiguration pipe, SerializationProperties params)
bei Saxon.Hej.s9api.XQueryEvaluator.getDestinationReceiver(Destination destination)
bei Saxon.Hej.s9api.XQueryEvaluator.run(Destination destination)
bei Saxon.Hej.s9api.XQueryEvaluator.run()
bei Saxon.Api.XQueryEvaluator.Run(IDestination destination)
bei Program.<Main>$(String[] args) in C:\Users\marti\source\repos\SaxonCSEncodingTest1\Program.cs: Zeile24
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 2 years ago
Meanwhile, XmlPrime with .NET 6, when using Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
, does run such a query fine:
using XmlPrime;
using System.Text;
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
XQuery query = XQuery.Compile(@"xquery version '3.1'; declare namespace output = 'http://www.w3.org/2010/xslt-xquery-serialization'; declare option output:method 'text'; declare option output:item-separator ' '; declare option output:encoding 'Windows-1252'; (1 to 2) ! 'Test. Test.,Test'", new XQuerySettings() { XQueryVersion = XQueryVersion.XQuery31 });
using (var resultStream = File.OpenWrite("result-test1.txt"))
{
query.Serialize(resultStream);
}
Without the Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
it says that Windows-1252 is an unsupported encoding. But using Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
enables XmlPrime to use such an encoding (at least on Windows) with .NET 6/.NET Core.
What does SaxonCS do differently that even with Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
it doesn't know/find the encoding?
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
Saxon's CharacterSetFactory
includes a small set of character sets for which Saxon has built-in knowledge. In the Java version of the class, a request for an encoding outside that set falls back to using information from the JDK. That fallback code is absent in SaxonCS.
The reason we care is that we have to produce fallback serialization as XML character references for characters that aren't in the target character set. In the past that involved throwing and catching exceptions; today it involves calling CharacterSetEncoder.canEncode() potentially for every character, which is a heavy overhead even though we cache the result.
We could implement the kind of thing using an EncoderFallback
implementation on .NET, but it's a very fiddly API (I've just re-read the specs and it's very hard to see exactly how to replace x1234 with ሴ
), and it wasn't a strong enough requirement to make it in the first cut.
Does XmlPrime handle the fallback processing of unencodable characters correctly, as a matter of interest?
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
I've raised a StackOverflow question at
https://stackoverflow.com/questions/72736571/handling-unencodable-characters-in-c-sharp
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
Unless someone produces a better answer, I think we can use the logic of the JavaCharacterSet (which builds a local table of which characters are encodable and which aren't), with the difference that we have to implement Encoding.canEncode()
ourselves: we do this by passing a single character to the Encoder, with a fallback action that returns a substitute string, and checking whether we get the substitute string back.
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
There's a helpful response to my StackOverflow question which should provide a solution, though it may need adjustment for handling surrogate pairs.
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
Actually, I think I asked the wrong question.
We don't actually want to write an EncodingFallback handler that injects XML entity references (because we also need to inject things like <
on the same pass through the data). What we really need to do is to implement the method
static bool CanEncode(Encoding, char) {...}
which returns true if the character is encodable, false if not.
To do that, we have to pass a single-character string to the Encoder, and can follow the example at
https://docs.microsoft.com/en-us/dotnet/api/system.text.encoderreplacementfallback?view=net-6.0
to generate a magic response (e.g. "unencodable") as the replacement string for unencodable characters. We can then plug this into the existing logic for the Java class JavaCharacterSet.
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
This is now implemented.
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Martin Honnen over 2 years ago
Great. Just to be sure when I should go look for it, implemented on the 12 branch, i.e. the next major release, or on the 11 branch as well?
RE: SaxonCS output encoding support in comparison with Saxon.NET - Added by Michael Kay over 2 years ago
This will be in 12.x.
Please register to reply