How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

c++ c utf-8 character-encoding iso-8859-1

11,043

Solution 1

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
    append(ch);
} else {
    append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
    append(0x80 | (ch & 0x3f));
}

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

Solution 2

TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

Solution 3

If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.

Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

11,043

Author by

gabriel

Updated on June 05, 2022

Comments

gabriel almost 2 years

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?
ninjalj about 13 years

If it's real Latin1, the translation table is trivial, Latin1 maps directly to the first 256 Unicode codepoints.
Mark Ransom about 13 years

@ninjalj, this answer doesn't propose translating to codepoints but to UTF-8 sequences. Each sequence will be either one or two bytes.
Seva Alekseyev about 13 years

Two translations instead of one?
Nemanja Trifunovic about 13 years

One is not a translation - code units of ISO Latin 1 are exactly the same as the ones for UTF16, just of the different size. That's why I say he can probably supply the Latin1 string directly to the utf16to8 function.
ninjalj about 13 years

@Mark Ransom: it's the same, it's trivial to generate the table without having to look at loads of character tables.
ninjalj about 13 years

@Mark: which, incidentally, you would have to to translate from/to CP1252
ninjalj about 13 years

As I said, if it's real Latin1. Windows CP1252 (sometimes incorrectly called Latin1) has additional characters (in a range reserved in ISO-8859 for control characters), most notably, versions of opening and closing quotes.
ninjalj about 13 years

Oh, and there's no below on SO ;-P
dan04 about 13 years

(ch & 0xc0) >> 6 is redundant. You can just write ch >> 6.
Jason about 13 years

@dan04: can't ever hurt to be explicit.
spakai almost 7 years

I really can't understand the table on the wikipedia link. so if i have Latin-1 Ç , that falls under below 11bits, but how does the above following formula work?
spakai almost 7 years

ok this demo shines some light - codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/‌…
MaestroMaus over 6 years

This solution does seem to work for me on Unix systems but somehow does not seem to work on Windows with Visual Studio. Does anyone have any ideas?
MaestroMaus over 6 years

I tried this solution. It failed on special characters (such as ë) sadly. Nice theory though.