Project

Profile

Help

How are Unicode characters handled in Saxon-C?

Added by Martin Honnen almost 5 years ago

Now that I know how to build and run C++ programs using Saxon-C I am trying to understand how Unicode characters are handled, given that the API's use char*.

It seems with some tests that the result Saxon returns as a char* (from methods like runQueryToString) is UTF-8 encoded, at least I get some readable and correct output for Basic Multilingual Plane characters in my Windows 10 console window or Powershell window when I use chcp 65001 to set the codepage of the console to UTF-8.

Those consoles don't seem to be able to display characters outside the BMP anyway because of lack of suitable fonts, but when I simply run a file with such characters (like <item> </i tem>) through type the output is displayed as <item>[?][?]</item> (the [?] is supposed to represent a box with a question mark) while trying to output such a string returned from Saxon with cout fails to display anything.

A sample I have written to test is

#include "../../Saxon.C.API/SaxonProcessor.h"
#include "../../Saxon.C.API/XdmValue.h"
#include "../../Saxon.C.API/XdmItem.h"
#include "../../Saxon.C.API/XdmNode.h"
#include "../../Saxon.C.API/XdmAtomicValue.h"

#include <string>

#include <iostream>

using namespace std;

void fileTest(SaxonProcessor * processor, XQueryProcessor * queryProc, char *filename) {
	queryProc->clearProperties();
	queryProc->clearParameters(true);


	queryProc->setContextItemFromFile(filename);

	queryProc->setQueryContent(".");

	cout << "File " << filename << " evaluates to :|" << queryProc->runQueryToString() << "|" << endl;

}


int main(int argc, char *argv[]) {

	SaxonProcessor * processor = new SaxonProcessor(false);
	cout << "Test: XQueryProcessor with Saxon version=" << processor->version() << endl << endl;
	XQueryProcessor * query = processor->newXQueryProcessor();

	for (int i = 1; i < argc; i++) {
		fileTest(processor, query, argv[i]);
	}
	

	delete query;
	processor->release();

	char c;
	cin >> c;
	return 0;
}

then when I feed two input files as the command line argument, input1.xml as

<?xml version="1.0" encoding="utf-8"?>
<root>
  <items>
    <item>¿Qué pasa?</item>
    <item>Umlaut test:äöü߀</item>
    <item>μ</item>
    <test>
      <descripción>La versión española.</descripción>
    </test>
  </items>
</root>

and input2.xml as

<?xml version="1.0" encoding="utf-8"?>
<root>
  <items>
    <item>abc</item>
    <item> </item>
    <item> </item>
  </items>
</root>

the output shows no content at all for the second file:

Test: XQueryProcessor with Saxon version=Saxon/C 1.1.3 running with Saxon-HE 9.8.0.15J from Saxonica

File input1.xml evaluates to :|<root>
  <items>
      <item>¿Qué pasa?</item>
      <item>Umlaut test:äöü߀</item>
      <item>μ</item>
      <test>
         <descripción>La versión española.</descripción>
      </test>
  </items>
</root>
|
File input2.xml evaluates to :||

Is there a known issue with Unicode characters outside the BMP? Or why is the string not displayed?


Replies (2)

RE: How are Unicode characters handled in Saxon-C? - Added by Martin Honnen almost 5 years ago

I think with https://accu.org/index.php/journals/2404 I have found an explanation, there seems to be some restriction of the console output on Windows for non BMP characters.

Based on that it is not a Saxon issue at all.

RE: How are Unicode characters handled in Saxon-C? - Added by O'Neil Delpratt almost 5 years ago

Glad to hear you found an explanation to the problem. Please do not hesitate to contact us in case of any other questions.

    (1-2/2)

    Please register to reply