Convert UTF-8 to ANSI in C++

c++ utf-8 character-encoding ascii

13,285

Solution 1

Generally, one uses libiconv (webpage), which is portable and runs on most platforms. As KerrekSB mentioned, you will get in deep trouble if you think of a character set as "extended ASCII" -- I'm sure there are at least a hundred character sets that could be called "extended ASCII", including UTF-8.

Also, make sure you know which encoding you want: ISO-8859-1 or CP1252. The Windows version replaces the C1 control codes with additional printing characters.

Solution 2

Windows only:

string UTF8ToANSI(string s)
{
    BSTR    bstrWide;
    char*   pszAnsi;
    int     nLength;
    const char *pszCode = s.c_str();

    nLength = MultiByteToWideChar(CP_UTF8, 0, pszCode, strlen(pszCode) + 1, NULL, NULL);
    bstrWide = SysAllocStringLen(NULL, nLength);

    MultiByteToWideChar(CP_UTF8, 0, pszCode, strlen(pszCode) + 1, bstrWide, nLength);

    nLength = WideCharToMultiByte(CP_ACP, 0, bstrWide, -1, NULL, 0, NULL, NULL);
    pszAnsi = new char[nLength];

    WideCharToMultiByte(CP_ACP, 0, bstrWide, -1, pszAnsi, nLength, NULL, NULL);
    SysFreeString(bstrWide);

    string r(pszAnsi);
    delete[] pszAnsi;
    return r;
}

Solution 3

Assuming that by "ANSI" you really mean one of the ISO 8859 variants, we should start with a couple of points.

The first is that not every string can be converted from UTF-8 (or Unicode in general, regardless of the transformation used) into ISO 8859. Unicode has a unique code point for virtually every character in every language on earth.

ISO 8859 supports far fewer languages, and has a separate character set for each language it does supports; the same codes represent different characters in different languages.

This means it's quite easy for a UTF-8 input string to contain characters that can't be represented in any ISO 8859 variant at all, and it's also easy for it to contain characters that require different ISO 8859 variants to represent.

The second is that even at best, the transformation may be quite non-trivial. If at all possible, you almost certainly want to use a library (e.g., libiconv) for this task. Just for example, Unicode has a...feature called "combining diacritical marks", which lets you encode something like an "A with acute accent" as either a single code point or two separate code points (one for the "A" and the other for the accent). To encode that in ISO 8859, you'll have to convert those all to one form (normally the pre-combined form).

Before you do any significant work with the Unicode, you also normally want to convert the UTF-8 to UCS-4.

So, the sequence would be something like:

Convert UTF-8 to UCS-4
Convert combining diacritical marks to letters with diacritical marks (probably NFKC).
Check that all the characters can be encoded in the target character set
Convert to the target set

Depending on the way you prefer to do things, you might combine 3 and 4 into a single step, converting characters as you go and, for example, throwing an exception if you encounter a character that can't be represented in the target character set.

13,285