How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?
Solution 1
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
Solution 2
TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
Solution 3
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
gabriel
Updated on June 05, 2022Comments
-
gabriel almost 2 years
I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
-
ninjalj about 13 yearsIf it's real Latin1, the translation table is trivial, Latin1 maps directly to the first 256 Unicode codepoints.
-
Mark Ransom about 13 years@ninjalj, this answer doesn't propose translating to codepoints but to UTF-8 sequences. Each sequence will be either one or two bytes.
-
Seva Alekseyev about 13 yearsTwo translations instead of one?
-
Nemanja Trifunovic about 13 yearsOne is not a translation - code units of ISO Latin 1 are exactly the same as the ones for UTF16, just of the different size. That's why I say he can probably supply the Latin1 string directly to the utf16to8 function.
-
ninjalj about 13 years@Mark Ransom: it's the same, it's trivial to generate the table without having to look at loads of character tables.
-
ninjalj about 13 years@Mark: which, incidentally, you would have to to translate from/to CP1252
-
ninjalj about 13 yearsAs I said, if it's real Latin1. Windows CP1252 (sometimes incorrectly called Latin1) has additional characters (in a range reserved in ISO-8859 for control characters), most notably, versions of opening and closing quotes.
-
ninjalj about 13 yearsOh, and there's no below on SO ;-P
-
dan04 about 13 years
(ch & 0xc0) >> 6
is redundant. You can just writech >> 6
. -
Jason about 13 years@dan04: can't ever hurt to be explicit.
-
spakai almost 7 yearsI really can't understand the table on the wikipedia link. so if i have Latin-1 Ç , that falls under below 11bits, but how does the above following formula work?
-
spakai almost 7 yearsok this demo shines some light - codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/…
-
MaestroMaus over 6 yearsThis solution does seem to work for me on Unix systems but somehow does not seem to work on Windows with Visual Studio. Does anyone have any ideas?
-
MaestroMaus over 6 yearsI tried this solution. It failed on special characters (such as ë) sadly. Nice theory though.