How do I convert wchar_t* to std::string?

64,850

Solution 1

You could just use wstring and keep everything in Unicode

Solution 2

std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );

Solution 3

You can convert a wide char string to an ASCII string using the following function:

#include <locale>
#include <sstream>
#include <string>

std::string ToNarrow( const wchar_t *s, char dfault = '?', 
                      const std::locale& loc = std::locale() )
{
  std::ostringstream stm;

  while( *s != L'\0' ) {
    stm << std::use_facet< std::ctype<wchar_t> >( loc ).narrow( *s++, dfault );
  }
  return stm.str();
}

Be aware that this will just replace any wide character for which an equivalent ASCII character doesn't exist with the dfault parameter; it doesn't convert from UTF-16 to UTF-8. If you want to convert to UTF-8 use a library such as ICU.

Solution 4

This is an old question, but if it's the case you're not really seeking conversions but rather using the TCHAR stuff from Mircosoft to be able to build both ASCII and Unicode, you could recall that std::string is really

typedef std::basic_string<char> string

So we could define our own typedef, say

#include <string>
namespace magic {
typedef std::basic_string<TCHAR> string;
}

Then you could use magic::string with TCHAR, LPCTSTR, and so forth

Solution 5

It's rather disappointing that none of the answers given to this old question addresses the problem of converting wide strings into UTF-8 strings, which is important in non-English environments.

Here's an example code that works and may be used as a hint to construct custom converters. It is based on an example code from Example code in cppreference.com.

#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>
#include <array>

std::string convert(const std::wstring& wstr)
{
    const int BUFF_SIZE = 7;
    if (MB_CUR_MAX >= BUFF_SIZE) throw std::invalid_argument("BUFF_SIZE too small");
    std::string result;
    bool shifts = std::wctomb(nullptr, 0);  // reset the conversion state
    for (const wchar_t wc : wstr)
    {
        std::array<char, BUFF_SIZE> buffer;
        const int ret = std::wctomb(buffer.data(), wc);
        if (ret < 0) throw std::invalid_argument("inconvertible wide characters in the current locale");
        buffer[ret] = '\0';  // make 'buffer' contain a C-style string
        result = result + std::string(buffer.data());
    }
    return result;
}

int main()
{
    auto loc = std::setlocale(LC_ALL, "en_US.utf8");  // UTF-8
    if (loc == nullptr) throw std::logic_error("failed to set locale");
    std::wstring wstr = L"aąß水𝄋-扫描-€𐍈\u00df\u6c34\U0001d10b";
    std::cout << convert(wstr) << "\n";
}

This prints, as expected:

program Printout

Explanation

  • 7 seems to be the minimal secure value of the buffer size, BUFF_SIZE. This includes 4 as the maximum number of UTF-8 bytes encoding a single character; 2 for the possible "shift sequence", 1 for the trailing '\0'.
  • MB_CUR_MAX is a run-time variable, so static_assert is not usable here
  • Each wide character is translated into its char representation using std::wctomb
  • This conversion makes sense only if the current locale allows multi-byte representations of a character
  • For this to work, the application needs to set the proper locale. en_US.utf8 seems to be sufficiently universal (available on most machines). In Linux, available locales can be queried in the console via locale -a command.

Critique of the most upvoted answer

The most upvoted answer,

std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );

works well only when the wide characters represent ASCII characters - but these are not what wide characters were designed for. In this solution, the converted string contains one char per each source wide char, ws.size() == test.size(). Thus, it loses information from the original wstring and produces strings that cannot be interpreted as proper UTF-8 sequences. For example, on my machine the string resulting from this simplistic conversion of "ĄŚĆII" prints as "ZII", even though its size is 5 (and should be 8).

Share:
64,850
codefrog
Author by

codefrog

Updated on January 07, 2022

Comments

  • codefrog
    codefrog over 2 years

    I changed my class to use std::string (based on the answer I got here but a function I have returns wchar_t *. How do I convert it to std::string?

    I tried this:

    std::string test = args.OptionArg();
    

    but it says error C2440: 'initializing' : cannot convert from 'wchar_t *' to 'std::basic_string<_Elem,_Traits,_Ax>'

  • codefrog
    codefrog over 13 years
    and I'll still get a const char* if I use .c_str()? I have other functions that expect const char*
  • Steve Townsend
    Steve Townsend over 13 years
    I'm going to make a guess that you are building your project in Unicode but really don't want that. If this is correct, you can change your project's properties to not build for Unicode and then you can use string. Check this in Project Properties, Configuration Properties, General, Character Set. You need this to say Use Multibyte Character Set to get rid of Unicode everywhere.
  • codefrog
    codefrog over 13 years
    Originally I planned to use Unicode for some parts but then I decided I'll worry about that later. At this point I'm only bothered to get the program to work. I'm using SimpleINI and SimpleOpt to load options and it uses Unicode. I'm also using the SDK of another software which also uses Unicode. Disabling Unicode all together might make even those parts of the code stop working.
  • Steve Townsend
    Steve Townsend over 13 years
    SimpleIni docs indicate it uses the same conventions as Windows and so will work whichever way you build. For Unicode it uses a W suffix, for multi-byte charset it uses an A suffix, on function and class names. You should use the undecorated names (no A or W) and it will build in the right code depending on your project settings.
  • Praetorian
    Praetorian over 13 years
    Since you're programming on Windows you probably should be using Unicode. The Windows API and NTFS natively support UTF-16, so building ASCII applications incur an aditional overhead where each function is doing string conversions for you.
  • Steve Townsend
    Steve Townsend over 13 years
    @Praetorian - regardless of the correctness of that advice in the general case, path of least resistance is to use MBCS, since code is using char* elsewhere
  • Praetorian
    Praetorian over 13 years
    @Steve: Yes, of course, I wasn't disputing that. If the OP doesn't have access to the source code that uses char * then he should convert the entire project to MBCS.
  • codefrog
    codefrog over 13 years
    I'm gonna try using wstring and see how it goes. Thanks for the answers.
  • Stephen
    Stephen about 10 years
    Many applications use utf-8 internally. Windows is a right pain because wchar_t isnt big enough and it doesnt really support utf-8 properly. This makes life difficult when you have (like me) a large codebase application which uses utf-8 internally. Mostly this works fine but its the interaction with some of the OS level functions that become annoying.
  • riv
    riv over 8 years
    How is it an accepted answer if it doesn't even answer the question?
  • Ian
    Ian over 7 years
    Provides the actual answer to the question!
  • Julian
    Julian about 7 years
    I like this solution for its simplicity. However, a little explanation couldn't hurt. It leaves open the question of how the characters are actually converted. Is there an information loss or are the wide characters converted to unicode?
  • zett42
    zett42 almost 7 years
    I don't know why this answer got so many upvotes, what it does is equivalent to char c = static_cast<char>( wideChar ) for each character, so it obviously looses information if the wide-string characters are not in ASCII range.
  • truthadjustr
    truthadjustr about 4 years
    My hero! Thank you for directly providing the answer for the 99.9% of us.
  • j b
    j b about 3 years
    @zett42 isn't that going to be true of any method to convert wchar_t to std::string, since by definition it's a lossy conversion...
  • zett42
    zett42 about 3 years
    @jb Depends on the encoding of the std::string. E. g. when using UTF-8 there is no loss of information.