C++ strings: UTF-8 or 16-bit encoding?

18,492

Solution 1

I would recommend UTF-16 for any kind of data manipulation and UI. The Mac OS X and Win32 API uses UTF-16, same for wxWidgets, Qt, ICU, Xerces, and others. UTF-8 might be better for data interchange and storage. See http://unicode.org/notes/tn12/.

But whatever you choose, I would definitely recommend against std::string with UTF-8 "only when necessary".

Go all the way with UTF-16 or UTF-8, but do not mix and match, that is asking for trouble.

Solution 2

UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.

Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.

Solution 3

I have never found any reasons to use anything else than UTF-8 to be honest.

Solution 4

If you decide to go with UTF-8 encoding, check out this library: http://utfcpp.sourceforge.net/

It may make your life much easier.

Solution 5

I've actually written a widely used application (5million+ users) so every kilobyte used adds up, literally. Despite that, I just stuck to wxString. I've configured it to be derived from std::wstring, so I can pass them to functions expecting a wstring const&.

Please note that std::wstring is native Unicode on the Mac (no UTF-16 needed for characters above U+10000), and therefore it uses 4 bytes/wchar_t. The big advantage of this is that i++ gets you the next character, always. On Win32 that is true in only 99.9% of the cases. As a fellow programmer, you'll understand how little 99.9% is.

But if you're not convinced, write the function to uppercase a std::string[UTF-8] and a std::wstring. Those 2 functions will tell you which way is insanity.

Your on-disk format is another matter. For portability, that should be UTF-8. There's no endianness concern in UTF-8, nor a discussion over the width (2/4). This may be why many programs appear to use UTF-8.

On a slightly unrelated note, please read up on Unicode string comparisions and normalization. Or you'll end up with the same bug as .NET, where you can have two variables föö and föö differing only in (invisible) normalization.

Share:
18,492
Carl Seleborg
Author by

Carl Seleborg

Software developer in Berlin at Ableton (a really cool place to work!), proud to hack away on our product Live, a music sequencer and performance tool for electronic musicians. I've been programming professionally for about 5 years, and am deeply in love with C++ which, just like the German language, provides enough obscure corners and crazy surprises to make every day a new challenge!   I also write a blog (in French), called 5h du matin.

Updated on June 03, 2022

Comments

  • Carl Seleborg
    Carl Seleborg almost 2 years

    I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).

    There are a few wishes/constraints:

    • It would be cool if it could run on limited hardware, such as computers with limited memory.
    • I want the code to run on Windows, Mac and (if resources allow) Linux.
    • I'll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
    • I would like to avoid working with two different kinds of strings when working with user-visible text and with the application's data.

    Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.

    If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?

  • Ben Straub
    Ben Straub over 15 years
    Actually, UTF-16 will fit most living language characters in two bytes; take a look at the [code point charts][unicode.org/charts/PDF/] for code points above U+10000; they're all ancient Greek or Roman symbols.
  • MSalters
    MSalters over 15 years
    My team's Mac programmer says wchar_t is 32 bits. And there is certainly a lot of code in our codebase which would break otherwise.
  • paercebal
    paercebal over 15 years
    Note that using UTF32 on mac uses a lot of memory. The 0.1% case you mention means that any wstring on Mac will be twice as large as the same string in UTF16 on Windows (I won't even mention Linux's char). This is one of the reasons Linux use UTF-8 char, and why Windows uses UTF-16 wchar_t.
  • Carl Seleborg
    Carl Seleborg over 15 years
    Just to clarify: with "utf-8 only when necessary", I actually meant that I would be using some utf-8 manipulation functions only when I actually needed to deal with characters - but all strings would always be utf-8.
  • Carl Seleborg
    Carl Seleborg over 15 years
    Accepted: I want a clear separation between GUI and data domains. The latter would be all about interchange and storage, so I don't mind the GUI layer converting to utf-16 wxStrings from utf-8 encoded std::string objects.
  • davidtbernal
    davidtbernal over 13 years
    You might want to read this question about UTF-16: stackoverflow.com/questions/1049947/…
  • Patrick Niedzielski
    Patrick Niedzielski over 13 years
    @Peter Mortensen: Ah, thanks. Didn't know about that feature.