How can I embed unicode string constants in a source file?
Solution 1
A tedious but portable way is to build your strings using numeric escape codes. For example:
wchar_t *string = L"דונדארןמע";
becomes:
wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.
You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX
, so just search & replace \u
with \x
to get the C format.
Solution 2
You have to tell GCC which encoding your file uses to code those characters into the file.
Use the option -finput-charset=charset
, for example -finput-charset=UTF-8
. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset
, for example -fwide-exec-charset=UTF-32
. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t
gcc uses.
You can adjust that. That option is mainly useful for compiling programs for wine
, designed to be compatible with windows. The option is called -fshort-wchar
, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.
Those options are described in more detail in man gcc
, the gcc manpage.
Related videos on Youtube
Comments
-
jkp almost 2 years
I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.
The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...
/// /// Protected: TestGetHebrewConfigString /// void CPrIniFileReaderTest::TestGetHebrewConfigString() { prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName ); CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() ); prIniListReader.SetCurrentSection( strHebrewSubSection ); CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") ); }
This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT ) wstring towstring( LPCSTR lpszValue ) { wostringstream os; os << lpszValue; return os.str(); }
The assertion in the test above then became:
CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );
This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?
-
deft_code almost 13 yearsIn windows wchar_t is 16bits and everyone else is 32bits. Does this effect what hex literals need to be listed? Or does
\x05d3
work equally well for 16 and 32 bit? -
Pavel Hájek almost 13 yearsThere is no limit on the number of hex digits after \x, so this should work the same whatever sizeof(wchar_t). See this topic for more info: stackoverflow.com/questions/2735101/unicode-escaping-in-c-c