Strings and character encoding in C++

c++ string unicode utf-8 character-encoding

32,742

Solution 1

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

Solution 2

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.

If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux. The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.

For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.

Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.

The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2). Visual Studio 2010 has already implemented this, afaik.

Solution 3

The traits approach described here might be helpful. It's an old but useful technique.

32,742

Author by

nassar

Updated on March 19, 2020

Comments

nassar about 4 years
I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:
```
typedef std::string string8;
typedef std::basic_string<uint32_t> string32;
```
The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.

The string32 class would be used for UTF-32 when a fixed character size is desired.

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.
nassar over 13 years

I've intentionally avoided UCS-2, because it seems to me that if one is going to the trouble of handling character encoding, one might as well do it right and support full Unicode. (At the same time, I'm looking for something less cumbersome than ICU for general purpose use.) As for UTF-16, it seems to have the disadvantages of both variable length encoding and using lots of memory. That is why I propose using UTF-8 and UTF-32 in combination.
nassar over 13 years

Point taken about deriving from std::string. Thanks!
nassar over 13 years

I think defining a new type is not at all essential, but a lot of people seeing std::string in code will tend to forget about multi-byte characters and incorrectly use character positions. The fact that it is UTF-8 can be conveyed in comments, but having a reminder in the type name seems helpful because methods such as std::string::insert() do suggest 8-bit characters in my opinion.
nassar over 13 years

I just read that C++0x will define u32string as basic_string<char32_t>. So this should be good for UTF-32.
cytrinox over 13 years

Yes, C++0x introduces char16_t and char32_t to have a "wide char" with a specific size on all platforms. Defining a new type string8 is not generally bad, but it may confuses if you write libraries or reusable code. If I need to build a project which includes 3-4 libraries and each lib introduces their own types I have to deal with lib1::string8, lib2::ustring, lib3::utf8 and my own type (or std::string) - and just a look to these types doesn't tell me if it is just another name for std::string or a completly new and imcompatible class which has to be handled in a special way.
cytrinox over 13 years

For completion, if you only need to convert between the different UTFs and you already use c++0x features, there are a few new codecvts for that, for example codecvt<char16_t, char, mbstate_t> and codecvt<char32_t, char, mbstate_t> which converts char (UTF-8) to UTF16/32. Together with std::wstring_convert and std::wbuffer_convert you can easily convert between UTF without any additionaly library. If you need to convert other charsets, you can write your own codecvts using iconv() on linux and MultiByteToWideChar() & Co. on windows.
nassar over 13 years

Thanks again for all of your suggestions.
nassar over 13 years

I found your comment about diacritics a bit scary. It is in a sense most relevant to what I am trying to do, which is to handle strings "correctly" in a relatively simple way.
Matthieu M. over 13 years

@nassar: unfortunately it's scary because we lack proper support :'(
Steven R. Loomis over 13 years

ICU has (among other interfaces in C++) a C++ string class which interoperates with std::string
Matthieu M. over 13 years

@Steven: icu-project.org/apiref/icu4c/classUnicodeString.html which I consider C-ish in its interface (lots of interaction with unmanaged memory, uses of int32_t where unsigned would be better suited, ...) though as you mention, thanks to StringPiece it can be created quite smoothly from a std::string.
Steven R. Loomis over 13 years

@Matthieu many cases where int32_t is used, '-1' means 'use u_strlen for the length'. Also, UText takes 64 bit text offsets. There isn't "lots of interaction with unmanaged [??] memory" if you let UnicodeString manage the memory.
Matthieu M. over 13 years

@Steven: should not have use managed there, sorry. I was talking about all the char* pointers that are passed, which seems really strange to me. For me char* means that the method MAY modify the buffer passed, and there is no say who is responsible for the memory pointed to (I guess the caller is).
Steven R. Loomis over 13 years

@Matthieu - memory ownership should be clearly documented on these functions, you can file a bug if you find any holes.
nassar over 13 years

@Steven: Do you have insight into why code points are used instead of graphemes as the "default" unit for strings in ICU? Wouldn't using graphemes address the diacritics problem that Matthieu M. described?
Steven R. Loomis over 13 years

@nassar Not code points, code units are the default. Because that is the simplest string handling and has a fixed width. But there are other functions that will give user perceived chars, extended grapheme clusters etc. I would use the two in conjunction: Given wanting to break a string, scan forward or back to the next or previous grapheme break and break there.
joel over 2 years

links go stale. Answer would be better with content included