How can I embed unicode string constants in a source file?

c++ unit-testing string unicode constants

13,762

Solution 1

A tedious but portable way is to build your strings using numeric escape codes. For example:

wchar_t *string = L"דונדארןמע";

becomes:

wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";

You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.

You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.

Solution 2

You have to tell GCC which encoding your file uses to code those characters into the file.

Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.

You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.

Those options are described in more detail in man gcc, the gcc manpage.

13,762

jkp

of Spotify {Stockholm|London} fame

Updated on June 04, 2022

Comments

jkp almost 2 years
I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.

The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...
```
///
/// Protected: TestGetHebrewConfigString
///  
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
    prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
    CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
    prIniListReader.SetCurrentSection( strHebrewSubSection );   

    CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}
```
This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)
```
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )

wstring towstring( LPCSTR lpszValue )
{
    wostringstream os;
    os << lpszValue;
    return os.str();
}
```
The assertion in the test above then became:
```
CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );
```
This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?
deft_code almost 13 years

In windows wchar_t is 16bits and everyone else is 32bits. Does this effect what hex literals need to be listed? Or does \x05d3 work equally well for 16 and 32 bit?
Pavel Hájek almost 13 years

There is no limit on the number of hex digits after \x, so this should work the same whatever sizeof(wchar_t). See this topic for more info: stackoverflow.com/questions/2735101/unicode-escaping-in-c-c