std::wstring VS std::string

371,874

Solution 1

string? wstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char vs. wchar_t

char is supposed to hold a character, usually an 8-bit character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.

What about Unicode, then?

The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main()
{
    const char text[] = "olé";


    std::cout << "sizeof(char)    : " << sizeof(char) << "\n";
    std::cout << "text            : " << text << "\n";
    std::cout << "sizeof(text)    : " << sizeof(text) << "\n";
    std::cout << "strlen(text)    : " << strlen(text) << "\n";

    std::cout << "text(ordinals)  :";

    for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
    {
        unsigned char c = static_cast<unsigned_char>(text[i]);
        std::cout << " " << static_cast<unsigned int>(c);
    }

    std::cout << "\n\n";

    // - - -

    const wchar_t wtext[] = L"olé" ;

    std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
    //std::cout << "wtext           : " << wtext << "\n"; <- error
    std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << "\n";
    std::wcout << L"wtext           : " << wtext << "\n";

    std::cout << "sizeof(wtext)   : " << sizeof(wtext) << "\n";
    std::cout << "wcslen(wtext)   : " << wcslen(wtext) << "\n";

    std::cout << "wtext(ordinals) :";

    for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
    {
        unsigned short wc = static_cast<unsigned short>(wtext[i]);
        std::cout << " " << static_cast<unsigned int>(wc);
    }

    std::cout << "\n\n";
}

outputs the following text:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol�
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).

Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

  1. When I should use std::wstring over std::string?

    On Linux? Almost never (§). On Windows? Almost always (§). On cross-platform code? Depends on your toolkit...

    (§) : unless you use a toolkit/framework saying otherwise

  2. Can std::string hold all the ASCII character set including special characters?

    Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

    On Linux? Yes. On Windows? Only special characters available for the current locale of the Windows user.

    Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
    2. a char from 0 to 127 will be held correctly
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
  3. Is std::wstring supported by almost all popular C++ compilers?

    Mostly, with the exception of GCC based compilers that are ported to Windows. It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

  4. What is exactly a wide character?

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).

Solution 2

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.

Solution 3

So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].

My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.

My solution, after in-depth investigation, much frustration and the consequential experiences is the following:

  1. accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)

  2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)

  3. accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).

  4. use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).

  5. use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.

  6. add two utility functions to convert back & forth between UTF-8 and UCS-2:

    UCS2String ConvertToUCS2( const UTF8String &str );
    UTF8String ConvertToUTF8( const UCS2String &str );
    

The conversions are straightforward, google should help here ...

That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.

Alternatives & Improvements

  • conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.

  • if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)

ICU or other unicode libraries?

For advanced stuff.

Solution 4

  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.

Solution 5

I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.

For example, I use utf-8 when interfacing my code with the Tcl interpreter.

The major caveat is the length of the std::string, is no longer the number of characters in the string.

Share:
371,874
Rapptz
Author by

Rapptz

Hello. I am a C++ programmer with knowledge in Python. I like to frequent the Lounge C++ chat. All code I post in my answers (unless explicitly mentioned) are licensed under CC0 or an equivalent license in your jurisdiction. Achievements: Gold C++ badge #250

Updated on July 13, 2022

Comments

  • Rapptz
    Rapptz almost 2 years

    I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

    1. When should I use std::wstring over std::string?
    2. Can std::string hold the entire ASCII character set, including the special characters?
    3. Is std::wstring supported by all popular C++ compilers?
    4. What is exactly a "wide character"?
    • MSalters
      MSalters over 15 years
      The ASCII charachter set doesn't have a lot of "special" characters, the most exotic is probably ` (backquote). std::string can hold about 0.025% of all Unicode characters (usually, 8 bit char)
    • Zonko
      Zonko almost 13 years
      If by "special" you mean the characters from 128 to 255, that depend on the norm used, then yes they are supported.
    • Yariv
      Yariv about 12 years
      Good information about wide characters and which type to use can be found here: programmers.stackexchange.com/questions/102205/…
    • Pavel Radzivilovsky
      Pavel Radzivilovsky almost 12 years
      Well, and since we are in 2012, utf8everywhere.org was written. It pretty much answers all questions about rights and wrongs with C++/Windows.
    • Yakov Galka
      Yakov Galka almost 12 years
      @MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level. Native narrow encoding doesn't support Unicode? No problem, just don't use it, use UTF-8 instead.
    • nickolay
      nickolay almost 11 years
      Concerning WinAPI based application it's very inconvenient to use std::string because you'll loose on conversions (UNICODE <-> ANSI) which happen very often. Of course, you can use ANSI aliases of WinAPI functions but they are only macroses which implicitly convert your ANSI encoded arguments to UNICODE ones and call "real" API code that is ALL UNICODE based (refer to J.Richter "Programming Windows" 5th ed.)
    • Timothy Shields
      Timothy Shields over 10 years
      Great reading on this topic: utf8everywhere.org
  • Admin
    Admin over 15 years
    2. An std::string can hold a NULL character just fine. It can also hold utf-8 and wide characters as well.
  • Admin
    Admin over 15 years
    @Juan : That put me into confusion again. If std::string can keep unicode characters, what is special with std::wstring?
  • Admin
    Admin over 15 years
    Juan : Do you mean that std::string can hold all unicode characters but the length will report incorrectly? Is there a reason that it is reporting incorrect length?
  • Admin
    Admin over 15 years
    When using the utf-8 encoding, a single unicode character may be made up of multiple bytes. This is why utf-8 encoding is smaller when using mostly characters from the standard ascii set. You need to use special functions (or roll your own) to measure the number of unicode characters.
  • Mr Fooz
    Mr Fooz over 15 years
    std::string can hold 0 just fine (just be careful if you call the c_str() method)
  • Greg D
    Greg D over 15 years
    @Appu: std::string can hold UTF-8 unicode characters. There are a number of unicode standards targeted at different character widths. UTf8 is 8 bits wide. There's also UTF-16 and UTF-32 at 16 and 32 bits wide respectively
  • Admin
    Admin over 15 years
    With a std::wstring. Each unicode character can be one wchar_t when using the fixed length encodings. For example, if you choose to use the joel on software approach as Greg links to. Then the length of the wstring is exactly number of unicode characters in the string. But it takes up more space
  • Greg Domjan
    Greg Domjan over 15 years
    (Windows specific) Most functions will expect that a string using bytes is ASCII and 2 bytes is Unicode, older versions MBCS. Which means if you are storing 8 bit unicode that you will have to convert to 16 bit unicode to call a standard windows function (unless you are only using ASCII portion).
  • Admin
    Admin over 15 years
    I didn't mean to offend. But I didn't agree with your answers to both 1 and 2. I can see from Joel's argument why you may want to use wchar_t when working on a windows system. However, a regular char works just as well for i18n.
  • Admin
    Admin over 15 years
    As Greg and Joel (on software) mention, it is really important to understand how the encoding works with the API you are dealing with. Constantly changing back and forth between 8 and 16 bit encoding on a windows system may not be optimal.
  • josesuero
    josesuero over 15 years
    And strictly speaking, a char isn't guaranteed to be 8 bits. :) Your link in #4 is a must-read, but I don't think it answers the question. A wide character is strictly nothing to do with unicode. It is simply a wider character. (How much wider depends on OS, but typically 16 or 32 bit)
  • Johannes Schaub - litb
    Johannes Schaub - litb over 15 years
    yes, jalf. c89 specifies minimal ranges for basic types in its documentation of limits.h (for unsigned char, that's 0..255 min), and a pure binary system for integer types. it follows char, unsigned char and signed char have minimum bit lengths of 8. c++ inherits those rules.
  • gnud
    gnud over 15 years
    Hum. I didn't know that windows did not follow the POSIX spec in this regard. POSIX says that a wchar_t must be able to represent "distinct wide-character codes for all members of the largest character set specified among the locales supported by the compilation environment".
  • paercebal
    paercebal over 15 years
    @gnud: Perhaps wchar_t was supposed to be enough to handle all UCS-2 chars (most UTF-16 chars) before the advent of UTF-16... Or perhaps Microsoft did have other priorities than POSIX, like giving easy access to Unicode without modifying the codepaged use of char on Win32.
  • Pavel Radzivilovsky
    Pavel Radzivilovsky over 14 years
    @dave: I don't know what headache does UTF-8 create which is greater than that of Widechars (UTF-16). in UTF-16, you also have multi-character characters.
  • sorin
    sorin over 14 years
    Your response does explain very well the differences between the two alternatives. Remark: UTF-8 can take 1-6 bytes and not 1-4 like you wrote. Also I would like to see people opinion between the two alternatives.
  • paercebal
    paercebal over 14 years
    @Sorin Sbarnea: UTF-8 could take 1-6 bytes, but apparently the standard limits it to 1-4. See en.wikipedia.org/wiki/UTF8#Description for more information.
  • Logan Capaldo
    Logan Capaldo almost 14 years
    "This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters." This is not entirely accurate, even for Unicode. It would be more accurate to say codepoint than "logical character", even in UTF-32 a given character may be composed of multiple codepoints.
  • WolfgangP
    WolfgangP almost 14 years
    Compiling and executing your code on Mac OS X gives the same output as on your linux machine.
  • paercebal
    paercebal almost 14 years
    @Wolfgang Plaschg : Thanks for the info. This is not unexpected, as MacOS X is a Unix, so this seems natural they went the way "char is a UTF-8" for Unicode support... AFAIK, the only reasons Windows did not follow the same road was to continue support for pre-Unicode charset-based old apps.
  • Mihai Nita
    Mihai Nita almost 13 years
    @paercebal UTF-8 cannot take 6 bytes. Exactly because the the standard limits it to 4 bytes. The standard defines things, so 6 bytes means it is not UTF-8 anymore, by definition.
  • paercebal
    paercebal almost 13 years
    @Mihai Nita : UTF-8 cannot take 6 bytes. Exactly because the the standard limits it to 4 bytes. . I agree. I agree so much with you I did already write that in a previous comment : @Sorin Sbarnea: UTF-8 could take 1-6 bytes, but apparently the standard limits it to 1-4. ... ^_^ ... I guess the point of my remark was to remind that the limitation to 4 was artificial, that the encoding used by UTF-8 could support up to 6 bytes for a 1-byte char, even if the standard decided to limit it to 4.
  • Jim Michaels
    Jim Michaels over 12 years
    I want to do #include <stdlib.h> std::wstring ws; ws += wchar(2591); /*25% shade character */ std::wcout<<ws; but this gets me empty output. HOW do I put in a specific large unicode char number into a wstring and output it?
  • paercebal
    paercebal over 12 years
    @Jim Michaels : You're trying to output the character x0A1F (Gurmukhi). a wchar_t is able to contain that character, so your string is correct. If the wcout output is not correct, it may be because to font use for the output console is not ready for the Gurmukhi symbols ( unicode.org/charts/PDF/U0A00.pdf )
  • John Leidegren
    John Leidegren over 11 years
    While this examples produces different results on Linux and Windows the C++ program contains implementation-defined behavior as to whether olè is encoded as UTF-8 or not. Further more, the reason you cannot natively stream wchar_t * to std::cout is because the types are incompatible resulting in an ill-formed program and it has nothing to do with the use of encodings. It's worth pointing out that whether you use std::string or std::wstring depends on your own encoding preference rather than the platform, especially if you want your code to be portable.
  • paercebal
    paercebal over 11 years
    @JohnLeidegren : While this examples produces different results on Linux and Windows the C++ program contains implementation-defined behavior as to whether olè is encoded as UTF-8 or not. : Yes. Indeed, the point was to show that. Further more, the reason you cannot natively stream wchar_t * to std::cout is because the types are incompatible resulting in an ill-formed program and it has nothing to do with the use of encodings. : Indeed. I was giving the multiple combinations, and if not possible, explaining why in the code, for completness' sake, not making the point you suggest...
  • paercebal
    paercebal over 11 years
    @John Leidegren : It's worth pointing out that whether you use std::string or std::wstring depends on your own encoding preference rather than the platform : Indeed. But then, if the constraints are "use unicode, while not using 4 bytes for each character", the platform pretty much limits your options, that is, std::wstring on Windows, and std::string on Linux... (You could try to use an UTF-8 std::string on Windows, but then, your UTF-8 strings would not be understood by the WinAPI using char * characters.)
  • John Leidegren
    John Leidegren over 11 years
    @paercebal Whatever the platform supports is entirely arbitrary and besides the point. If you store all strings internally as UTF-8 on Windows you'll have to convert them to either ANSI or UTF-16 and call the corresponding Win32 function but if you know your UTF-8 strings are just plain ASCII strings you don't have to do anything. The platform doesn't dictate how you use strings as much as the circumstances.
  • paercebal
    paercebal over 11 years
    @John Leidegren : Of course the platform dictates how you use the strings. On Windows, you have no choice: char strings have a specific codepage/encoding, so how you use the std::string, either by writing convertors, or by using codepage specific functions, must be decided. As for std::wstring, unless you use a conversion interface, you know the encoding must be the Windows version of UTF-16 (last time I checked, it was UCS-2), thus how you interpret the characters in that context. As I see this, this is "how", not "circumstances". But let's not lose time on vocabulary...
  • John Leidegren
    John Leidegren over 11 years
    Windows actually uses UTF-16 and have been for quite some time, older versions of Windows did use UCS-2 but this is not the case any longer. My only issue here is the conclusion that std::wstring should be used on Windows because it's a better fit for the Unicode Windows API which I think is fallacious. If your only concern was calling into the Unicode Windows API and not marshalling strings then sure but I don't buy this as the general case.
  • paercebal
    paercebal over 11 years
    @ John Leidegren : If your only concern was calling into the Unicode Windows API and not marshalling strings then sure : Then, we agree. I'm coding in C++, not JavaScript. Avoiding useless marshalling or any other potentially costly processing at runtime when it can be done at compile time is at the heart of that language. Coding against WinAPI and using std::string is just an unjustified wasting runtime resources. You find it fallacious, and it's Ok, as it is your viewpoint. My own is that I won't write code with pessimization on Windows just because it looks better from the Linux side.
  • Yakov Galka
    Yakov Galka over 11 years
    @gnud: see this great answer for why the POSIX requirement (in fact it is C++ requirement) does not violate the use of variable length encoding.
  • Mihai Danila
    Mihai Danila over 10 years
    Are you guys in essence saying that C++ doesn't have native support for the Unicode character set?
  • Mihai Danila
    Mihai Danila over 10 years
    Dang, it's not good to know that native Unicode support isn't there.
  • Mihai Danila
    Mihai Danila over 10 years
    Not only will a std::string report the length incorrectly, but it will also output the wrong string. If some Unicode character is represented in UTF-8 as multiple bytes, which std::string thinks of as its own characters, then your typically std::string manipulation routines will probably output the several strange characters that result from the misinterpretation of the one correct character.
  • Kusavil
    Kusavil over 9 years
    If I want to make program (working on windows) that will be freely using many different Unicode symbols, like Japanese / Chinese characters, Polish letters, Cyrillic, etc., what should I use? Will UTF-8 be enough?
  • mamakurka
    mamakurka over 9 years
    As a slight correction, UTF-16 encoding can take either 2 OR 4 bytes per character. (see unicode.org/faq/utf_bom.html#gen6)
  • paercebal
    paercebal over 9 years
    @lfalin : Indeed. The first time I speak about wide characters on Windows, I describe how Windows was not quite clear (at least, to me) about how it handled "unicode" (what is UCS-2 or UTF-16?). The second time, I write about the size of a character: "All in all, UTF-16 will mostly use 2 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.", which is more or less what you're saying (the keyword being "mostly"). I guess what should be clarified in my answer is Windows' stance on the subject.
  • Caroline Beltran
    Caroline Beltran over 9 years
    @Frunsi, I'm curious to know if you've tried Glib::ustring and if so, what are your thoughts?
  • Frunsi
    Frunsi over 9 years
    @CarolineBeltran: I know Glib, but I never used it, and I probably will never even use it, because it is rather limited to a rather unspecific target platform (unixoid systems...). Its windows port is based on external win2unix-layer, and there IMHO is no OSX-compatibility-layer at all. All this stuff is directing clearly into a wrong direction, at least for my code (on this arch level...) ;-) So, Glib is not an option
  • StarShine
    StarShine over 9 years
    What @Mihai Danila said. I strongly recommend against using std::string for utf-8, especially when doing frequent string operations like concatenation and sub-string. Widestrings can take a lot of place, but if you are serious about software products and data in a multilingual and multicultural world, the use of std::string is becoming archaic, and trying to use it just litters the code in all kinds of odd places with functions that 'look correct' for most of the time. I've been in game development for nearly 10 years, on many different platforms, so I know what I'm saying.
  • Mihai Danila
    Mihai Danila over 9 years
    I suggest changing the answer to indicate that strings should be thought of as only containers of bytes, and, if the bytes are some Unicode encoding (UTF-8, UTF-16, ...), then you should use specific libraries that understand that. The standard string-based APIs (length, substr, etc.) will all fail miserably with multibyte characters. If this update is made, I will remove my downvote.
  • StarShine
    StarShine over 9 years
    I think points 2 and 3 are yelling to NOT use std::string for utf8. IF you still want to save on the memory, then subclass std::string so that you get atleast asserts and warnings when you use substr, concat and length, and basically any content perturbing string operation functionality. Personally I advise to use wstrings for unicode strings, regardless if you settle on utf8, 16 or 32, or ucs-2. You'll have a much easier time doing IO with those. Even UI components nowadays deal properly with unicode strings, so the downconversion should only be necessary when dealing with older components.
  • Frunsi
    Frunsi over 9 years
    @StarShine & @CarolineBeltran: Maybe... But subclassing std::string results in just another view on the problem, which is just another wrong kind of "std::string", as std::string itself already is. A comprehensive solution would contain of an std::string that differs between memory layout issues and character sequence issues. So, for a start, for example, an std::string should have a method size() and a method nchars().
  • Frunsi
    Frunsi over 9 years
    BTW: Even C++11x, C++14x nor any future standards, nor anyone else did yet care about that issue. So, I18N in C++ is still a thing where solutions are still expected...
  • Frunsi
    Frunsi over 9 years
    Oh and @StarShine: read the full answer please. It is not as easy as you may think.
  • StarShine
    StarShine over 9 years
    @Frunsi: Ah, maybe I missed it. How does your "UTF8String" typedef bring about a comprehensive solution that differs between memory layout issues and character sequence issues? It's a refactoring tool at best, but not a solution. Firstly, good luck forcing nchars() into the standard. Secondly, how sure can you really be that 3rd party libs are not chopping up your utf8 sequences? Finally, utf8 is harder to parse and debug. If you use wstring and ucs2 or proper utf16 from the start, your debugger will display the correct Chinese string, without you having to puzzle it together from byte codes.
  • Frunsi
    Frunsi over 9 years
    @StarShine: An UTF8String typedef is not a comprehensive solution. It is just a pragmatic solution that works (in most cases, most of the time). IMHO it is time for the C++ standards people to provide a better solution. The basics (Unicode and its different encoding schemes, as UTF8 and UCS-2, are here and here to stay), so it is the right time now ;-)
  • Frunsi
    Frunsi over 9 years
    @StarShine: Please also note, that my solution will have the same issues as UCS-2, e.g. when working with chinese strings! So, this is really just a pragmatical thing, no comprehensive solution.
  • Daniel
    Daniel over 9 years
    Search, replace, and so on works just fine on UTF-8 strings (a part of the byte sequence representing a character can never be misinterpreted as another character). In fact, UTF-16 and UTF-32 don't make this any easier at all: all three encodings are multibyte encodings in practice, because a user-perceived character (grapheme cluster) can be any number of unicode codepoints long! The pragmatic solution is to use UTF-8 for everything, and convert to UTF-16 only when dealing with the Windows API.
  • Frunsi
    Frunsi over 9 years
    @Daniel: Why do you think a pragmatical solution would use UTF-8 for everything? Single-Byte Search & Replace code may not do much harm on UTF-8 byte sequences, but it will not solve actual problems either :P Using UTF-8 for "everything" is the wrong path for anyone... Using UTF-8 for storage & transfer is fine, but using it for processing strings will result in exponential growth of required code to handle all cases & combinations. Maybe. But maybe all character-based operations can be rewritten to work on graphemes? Probably not, right? So...
  • Frunsi
    Frunsi over 9 years
    @Daniel: "Search, replace and so on" will NOT just work fine on UTF-8 strings, unfortunately it is much more complicated, see e.g. utf8everywhere.org/#myth.strlen - and of course UTF-16 and UTF-32 don't make this easier. So?
  • Daniel
    Daniel over 9 years
    @Frunsi: Search and replace works just as fine with UTF-8 as with UTF-32. It's precisely because proper Unicode-aware text processing needs to deal with multi-codepoint 'characters' anyways, that using a variable length encoding like UTF-8 doesn't make string processing any more complicated. So just use UTF-8 everywhere. Normal C string functions will work fine on UTF-8 (and correspond to ordinal comparisons on the Unicode string), and if you need anything more language-aware, you'll have to call into a Unicode library anyways, UTF-16/32 can't save you from that.
  • Deduplicator
    Deduplicator over 9 years
    "But it won't understand the meaning of unicode code points." On windows, neither does std::wstring.
  • Climax
    Climax almost 9 years
    interesting to note that if you do a cout before the wcout the unicode characters don't print with wcout. If, however, you start with wcout, the cout's don't even print at all, and all unicode prints print correctly. Almost as if some internal state is kept in the libs?
  • Deduplicator
    Deduplicator over 8 years
    @paercebal: Just a note: One of those exotic languages is chinese btw. Thus the PRC decided to make support for some codepoints outside the BMP mandatory quite some time ago.
  • Deduplicator
    Deduplicator over 8 years
    So, paraphrasing the first paragraph: Application needing more than 256 characters need to use a multibyte-encoding or a maybe_multibyte-encoding.
  • Seppo Enarvi
    Seppo Enarvi over 8 years
    Generally 16 and 32 bit encodings such as UCS-2 and UCS-4 are not called multibyte encodings, though. The C++ standard distinguishes between multibyte encodings and wide characters. A wide character representation uses a fixed number (generally more than 8) bits per character. Encodings that use a single byte to encode the most common characters, and multiple bytes to encode the rest of the character set, are called multibyte encodings.
  • Deduplicator
    Deduplicator over 8 years
    Sorry, sloppy comment. Should have said variable-length encoding. UTF-16 is a variable-length-encoding, just like UTF-8. Pretending it isn't is a bad idea.
  • Seppo Enarvi
    Seppo Enarvi over 8 years
    That's a good point. There's no reason why wstrings couldn't be used to store UTF-16 (instead of UCS-2), but then the convenience of a fixed-length encoding is lost.
  • Swift - Friday Pie
    Swift - Friday Pie over 7 years
    The problem is that if you're anywhere but English speaking country you OUGHT to use wchar_t. Not to mention that some alphabets have way more characters than you can fit into a byte. We were there, on DOS. Codepage schizophrenia, no, thanks, no more..
  • Piotr Findeisen
    Piotr Findeisen over 7 years
    "when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready." -- this should go with a BIG warning "never truncate, limit, take char-at" your strings. This can be understood from the whole answer, but should be made super-clear.
  • Michele
    Michele about 7 years
    What makes this a wchar_t[]?
  • Michele
    Michele about 7 years
    {0x42, 0x65, 0x6E, 0x6A, 0x61, 0x6D, 0xED, 0x6E, 0x20, 0x70, 0x69, 0x64, 0x69, 0xF3, 0x20, 0x75, 0x6E, 0x61, 0x20, 0x62, 0x65, 0x62, 0x69, 0x64, 0x61, 0x20, 0x64, 0x65, 0x20, 0x6B, 0x69, 0x77, 0x69, 0x20, 0x79, 0x20, 0x66, 0x72, 0x65, 0x73, 0x61, 0x3B, 0x20, 0x4E, 0x6F, 0xE9, 0x2C, 0x20, 0x73, 0x69, 0x6E, 0x20, 0x76, 0x65, 0x72, 0x67, 0xFC, 0x65, 0x6E, 0x7A, 0x61, 0x2C, 0x20, 0x6C, 0x61, 0x20, 0x6D, 0xE1, 0x73, 0x20, 0x65, 0x78, 0x71, 0x75, 0x69, 0x73, 0x69, 0x74, 0x61, 0x20, 0x63, 0x68, 0x61, 0x6D, 0x70, 0x61, 0xF1, 0x61, 0x20, 0x64, 0x65, 0x6C, 0x20, 0x6D, 0x65, 0x6E, 0xFA, 0x2E, 0x00};
  • underscore_d
    underscore_d almost 7 years
    Until this stunning oversight in the language is rectified, check out Glib::ustring, an actually intelligent wrapper around std::string from the glibmm project, which wraps the normal string methods with proper awareness of the number of displayable characters (not encoding bytes/chars) in the string.
  • underscore_d
    underscore_d almost 7 years
    @MihaiDanila That depends on how you define "native support". Can it store Unicode character sequences? Absolutely. Does it provide any standard class that can operate on such sequences in terms of the number of displayed characters therein, rather than just naively indexing/finding/etc by numbers of bytes, thereby possibly breaking up sequences of codepoints and getting things horribly wrong? No. And that's awful. This is 2017. I can only hope that, since we're finally getting standard filesystem and network support, maybe actual Unicode strings are faintly visible somewhere over the horizon.
  • underscore_d
    underscore_d almost 7 years
    @Swift The problem with wchar_t is that its size and meaning are OS-specific. It just swaps the old problems with new ones. Whereas a char is a char regardless of OS (on similar platforms, at least). So we might as well just use UTF-8, pack everything into sequences of chars, and lament how C++ leaves us completely on our own without any standard methods for measuring, indexing, finding etc within such sequences.
  • Swift - Friday Pie
    Swift - Friday Pie almost 7 years
    @underscore_d What you describe is the smallest of problems if you code in C++. Wide character wchat_t is fundamental type in C++, but not in C, but its binary representation is not platform-defined how you describe, it's runtime. So character can be 1 byte or 2 bytes long (at least) depending what actual string is stored. Unicode UTF-16 are fixed size characters. Thing is wchar_t is the type supported for certain platform on level of file system names (including windows), while other platforms use multibyte characters
  • underscore_d
    underscore_d almost 7 years
    @Swift You seem to have it completely backwards. wchar_t is a fixed-width data type, so an array of 10 wchar_t will always occupy sizeof(wchar_t) * 10 platform bytes. And UTF-16 is a variable-width encoding in which characters may be made up of 1 or 2 16-bit codepoints (and s/16/8/g for UTF-8).
  • Mihai Danila
    Mihai Danila almost 7 years
    @underscore_d Support for storing encoded Unicode codepoints into bytes is hardly notable as "support". And, yes, I agree that the absence of standard Unicode support in this language in the 21st century is laughable.
  • QuesterZen
    QuesterZen almost 7 years
    There do not seem to be any good options in standard C++ for cross-platform, international use. I recently wrote a text-driven GUI interface for a program that with custom line-breaks, semantic tagging, international characters... After researching multiple approaches, I chose std::strings using UTF-8 to store the text data, but writing a library of functions to map between characters and bytes, to perform common string functions such as text insertion, extraction and search, and to perform conversions to other formats for i/o. I came here to see if there was now a better way, it seems not.
  • Mooing Duck
    Mooing Duck almost 7 years
    @Michele: Nothing, that's just a sequence of bytes. It can't be interpreted as UTF8, but does appear interpretable as UTF16. Or any of a thousand code pages.
  • Stuntddude
    Stuntddude over 6 years
    @paercebal I realize this comment thread is as old as time itself, but insisting on matching WinAPI string format for performance reasons is just silly. The cost of the API calls themselves will dwarf conversion costs; the performance cost of the extra storage required for UTF-16 strings will probably negate any potential conversion-related gains; and if you communicate with other APIs, you'll likely need to do conversions anyway. See utf8everywhere.org/#faq.cvt.perf for an example.
  • Steve Hollasch
    Steve Hollasch over 6 years
    @Swift Sorry, that's wrong, at least for wchar_t on Windows. On Windows, a wchar_t is a UTF-16 encoding. Simple test: wchar_t *test = L"𠀀"; // Code point U+20000 In the debugger, you'll see a string of two values: 0xD840, and 0xDC00, which is the UTF-16 encoding of the character.
  • Swift - Friday Pie
    Swift - Friday Pie over 6 years
    @SteveHollasch you saved utf16 to it, so you get it. it's a compile-dependant primitive type that doesn't cast or limit what you try to assign to it. How API and compiler would treat ist is undefined, in general it is not same representation as ANY unicode.wchar_t as defined by windows api is 16bit per character. so what you have is a surrogate - two character with codes 0X00DC and 0x40D8. but code that would treat that as unicode array, would act properly, you just would have hard time to determine if it is 2 characters or one. On linux wchar_t is 32bit, your code will not cause a problem
  • Swift - Friday Pie
    Swift - Friday Pie over 6 years
    @SteveHollasch wchar_t representation of string on windows would encode characters greater than FFFF as aspecial surrogate pair, other would take only one wchar_t element. So that representation will not be compatible with representation created by gnu compiler (where all characters less than FFFF will have zero word in front of them). What is stored in wchar_t is determined by programmer and compiler, not by some agreement
  • Ruslan
    Ruslan over 5 years
    @MihaiDanila at least we do have std::codecvt<charNN_t, char> etc. since C++11 for conversion between UTF-NN and UTF-8. Though, std::wstring_convert is deprecated since C++17...
  • Roi Danton
    Roi Danton over 5 years
    Regarding "However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?." -> I don't think that this is true when the compiler uses UTF-8 encoding (use /utf-8).
  • Roi Danton
    Roi Danton over 5 years
    For a windows program, which gets its input as UTF-8 encoded strings, there is no point in converting everyhing to wchar_t. Only convert on direct interaction with WinAPI. As long as the compiler works with UTF-8 encoding, I see no point in preferring wchar_t over char. As usual, it depends on the requirements.
  • Phil Rosenberg
    Phil Rosenberg over 5 years
    I was not aware of this as an option. From this link docs.microsoft.com/en-us/cpp/build/reference/… it seems there is no tick box to select in in project properties, you must add it as an additional command line option. Good spot!
  • Aaron Franke
    Aaron Franke about 5 years
    How does std::string work with UTF-8? I thought that std::string uses char, which is only 1 byte?
  • Sean McMillan
    Sean McMillan over 4 years
    re: point 5, using 16-bit wide chars for string manipulation is simple, fast easy... and WRONG. Because, despite this answer saying that they're UCS-2, many environments are actually UTF-16, which means you have to deal with surrogates. And even without surrogates, you have to deal with combining characters. wchar doesn't protect you from any of that. Sadly, the real answer is "text is hard and complicated; learn how it really works."
  • Mr. Boy
    Mr. Boy about 4 years
    Thanks for reminding us the underlying truth: strings are horrible in C/C++
  • DUzun
    DUzun over 3 years
    Here is my explanation of string encodings in the context of JavaScript: github.com/duzun/string-encode.js/blob/master/…
  • Paul
    Paul over 3 years
    "A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!" - can you provide a source for this claim?
  • jrh
    jrh over 3 years
    I don't understand something, if I have a cross platform program, does that mean I need to make an abstraction layer over std::string for text that is localized, that would e.g., turn into std::string on Linux and std::wstring on Windows?
  • jrh
    jrh over 3 years
    I think your idea of using wstring only on API calls is interesting, but I am a bit confused about getting data in to the program; right now I am using a stringstream to pipe the data from a fstream into, is it safe to assume that the C++ standard library is capable of detecting that a text file is UTF-8 and will construct a string in the right encoding automatically? Or will it interpret the text file as 8 bit chars and return garbled text? Do the standards say anything about this?
  • Mooing Duck
    Mooing Duck about 2 years
    @jrh": The C++ standard library does not check file types or handle encodings. If you stream a UTF8 file into a std::string, you'll end up with a std::string that contains UTF8, with the pros and cons that entails. if you stream a UTF8 file into a std::wstring, then you end up with garbage. (Similarly, streaming a UTF16 file into a std::string produces garbage, but std::wstring would be valid, at least on Windows)
  • jrh
    jrh about 2 years
    @MooingDuck yes, I later found that to be the case. On a related note one of the very unfortunate parts of the standard library is that exception messages are always char* not wchar*, which is unfortunate in Windows if your exception message has to e.g., include a unicode file name / key / etc., or "Failed to parse '견고한 논리' as integer". That does add to the reasoning of "use UTF-8 as much as possible" because if you used wchars for most of the program instead you'd have to convert to UTF-8 to store an exception message, and that conversion itself can sadly throw an exception.
  • benrg
    benrg almost 2 years
    An important reason not to do this conversion is that WCHAR strings can contain unpaired surrogates. Filenames with unpaired surrogates exist in the wild (Cygwin uses them, for instance), but are rare enough that they may be missed in testing. A malicious party could create one to crash your program, or even do worse if, e.g., a failed conversion doesn't write a terminating NUL. You can work around this by using a UTF-8 compatible encoding that can roundtrip surrogates, but many Unicode libraries don't provide that, and of course it isn't UTF-8 so it violates your UTF-8 everywhere advice.
  • buzz3791
    buzz3791 almost 2 years
    @paercebal Really nice answer, but much is changing with c++20 char8_t and in Windows Windows... "New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization." docs.microsoft.com/en-us/windows/win32/intl/code-pages "Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. [...]" docs.microsoft.com/en-us/windows/apps/design/globalizing/…