Storing unicode UTF-8 string in std::string

35,055

Solution 1

If you were using C++11 then this would be easy:

std::string msg = u8"महसुस";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"

Otherwise, you might consider doing a conversion at runtime instead:

std::string toUtf8(const std::wstring &str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

std::string msg = toUtf8(L"महसुस");

Solution 2

You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

Solution 3

If you have C++11, you can write u8"महसुस". Otherwise, you'll have to write the actual byte sequence, using \xxx for each byte in the UTF-8 sequence.

Typically, you're better off reading such text from a configuration file.

Solution 4

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:

In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix

http://support.microsoft.com/kb/980263

Share:
35,055

Related videos on Youtube

Pritesh Acharya
Author by

Pritesh Acharya

I am involved in writing distributed application using ZMQ

Updated on July 09, 2022

Comments

  • Pritesh Acharya
    Pritesh Acharya 6 months

    In response to discussion in

    Cross-platform strings (and Unicode) in C++

    How to deal with Unicode strings in C/C++ in a cross-platform friendly way?

    I'm trying to assign a UTF-8 string to a std::string variable in Visual Studio 2010 environment

    std::string msg = "महसुस";

    However, when I view the string view debugger, I only see "?????" I have the file saved as Unicode (UTF-8 with Signature) and i'm using character set "use unicode character set"

    "महसुस" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5

    My question is:

    How do I use std::string to just store the utf-8 without needing to manipulate it?

  • Pritesh Acharya
    Pritesh Acharya over 8 years
    I don't have C++11 What difference does it make to read such text from configuration file?
  • Pritesh Acharya
    Pritesh Acharya over 8 years
    I'm using Visual Studio 2010, and since i don't have C++11, using ‘s8′ format specifier gives me compiler error
  • DNamto
    DNamto over 8 years
    Try again by adding #pragma execution_character_set("utf-8")
  • Sergey K.
    Sergey K. over 8 years
    @PriteshAcharya: s8 is for UTF-8, su is for multibyte unicode character set.
  • Sergey K.
    Sergey K. over 8 years
    @PriteshAcharya: btw, if you have "use unicode character set" in your configuration, how do you know you are assigning a UTF-8 string?
  • James Kanze
    James Kanze over 8 years
    @PriteshAcharya You free yourself from how the compiler might interpret it. Also: it's necessary if you want to provide several different translations.
  • Pritesh Acharya
    Pritesh Acharya over 8 years
    didn't help.I get the same result
  • Pritesh Acharya
    Pritesh Acharya over 8 years
    Acutally I don't know the answer for your question. I got the UTF-8 from another source and pasted into the source code. And I have my file encoding as UTF-8. Isn't it enough to be assured that the assignment is UTF-8 string?
  • Sergey K.
    Sergey K. over 8 years
    @PriteshAcharya: if you use multibyte character set in your project - yes, if you don't - no.
  • Pritesh Acharya
    Pritesh Acharya over 8 years
    this is the result of Command windows: >? msg.c_str(),s8 "?????" >? msg.c_str(),su "㼿㼿?坎劲䤪⸭䬩⧌啍噉촀췍﷽﷽ꮫꮫꮫꮫﻮﻮ"
  • DNamto
    DNamto over 8 years
  • Ayxan Haqverdili
    Ayxan Haqverdili over 1 year
    Since C++20, type of a u8"..." string literal is char8_t const [size], which cannot be implicitly converted to char const*, thus cannot be used to initialize std::string. Instead, you can either add an explicit cast like msg = (char const*)u8"...";, or maybe consider using std::u8string which is incompatible with std::string and C APIs that expect a plain char pointer.

Related