Forums » Saxon/C Help and Discussions »
How are Unicode characters handled in Saxon-C?
Added by Martin Honnen over 5 years ago
Now that I know how to build and run C++ programs using Saxon-C I am trying to understand how Unicode characters are handled, given that the API's use char*
.
It seems with some tests that the result Saxon returns as a char*
(from methods like runQueryToString
) is UTF-8 encoded, at least I get some readable and correct output for Basic Multilingual Plane characters in my Windows 10 console window or Powershell window when I use chcp 65001
to set the codepage of the console to UTF-8.
Those consoles don't seem to be able to display characters outside the BMP anyway because of lack of suitable fonts, but when I simply run a file with such characters (like <item> </i tem>
) through type
the output is displayed as <item>[?][?]</item>
(the [?]
is supposed to represent a box with a question mark) while trying to output such a string returned from Saxon with cout
fails to display anything.
A sample I have written to test is
#include "../../Saxon.C.API/SaxonProcessor.h"
#include "../../Saxon.C.API/XdmValue.h"
#include "../../Saxon.C.API/XdmItem.h"
#include "../../Saxon.C.API/XdmNode.h"
#include "../../Saxon.C.API/XdmAtomicValue.h"
#include <string>
#include <iostream>
using namespace std;
void fileTest(SaxonProcessor * processor, XQueryProcessor * queryProc, char *filename) {
queryProc->clearProperties();
queryProc->clearParameters(true);
queryProc->setContextItemFromFile(filename);
queryProc->setQueryContent(".");
cout << "File " << filename << " evaluates to :|" << queryProc->runQueryToString() << "|" << endl;
}
int main(int argc, char *argv[]) {
SaxonProcessor * processor = new SaxonProcessor(false);
cout << "Test: XQueryProcessor with Saxon version=" << processor->version() << endl << endl;
XQueryProcessor * query = processor->newXQueryProcessor();
for (int i = 1; i < argc; i++) {
fileTest(processor, query, argv[i]);
}
delete query;
processor->release();
char c;
cin >> c;
return 0;
}
then when I feed two input files as the command line argument, input1.xml
as
<?xml version="1.0" encoding="utf-8"?>
<root>
<items>
<item>¿Qué pasa?</item>
<item>Umlaut test:äöü߀</item>
<item>μ</item>
<test>
<descripción>La versión española.</descripción>
</test>
</items>
</root>
and input2.xml
as
<?xml version="1.0" encoding="utf-8"?>
<root>
<items>
<item>abc</item>
<item> </item>
<item> </item>
</items>
</root>
the output shows no content at all for the second file:
Test: XQueryProcessor with Saxon version=Saxon/C 1.1.3 running with Saxon-HE 9.8.0.15J from Saxonica
File input1.xml evaluates to :|<root>
<items>
<item>¿Qué pasa?</item>
<item>Umlaut test:äöü߀</item>
<item>μ</item>
<test>
<descripción>La versión española.</descripción>
</test>
</items>
</root>
|
File input2.xml evaluates to :||
Is there a known issue with Unicode characters outside the BMP? Or why is the string not displayed?
Replies (2)
RE: How are Unicode characters handled in Saxon-C? - Added by Martin Honnen over 5 years ago
I think with https://accu.org/index.php/journals/2404 I have found an explanation, there seems to be some restriction of the console output on Windows for non BMP characters.
Based on that it is not a Saxon issue at all.
RE: How are Unicode characters handled in Saxon-C? - Added by O'Neil Delpratt over 5 years ago
Glad to hear you found an explanation to the problem. Please do not hesitate to contact us in case of any other questions.
Please register to reply