Convert character from UTF-8 to ISO-8859-1 manually

11,560

Solution 1

The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.

In Unicode, every character ("code point") has a unique number assigned to it. The character ö is assigned the code point U+00F6, which is F6 in hexadecimal, and 246 in decimal.

UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.

If you do that transformation, you will see that U+00F6 transforms to the UTF-8 sequence C3 B6, or 1100 0011 1011 0110 in binary, which is why that is the UTF-8 representation of ö.

The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö is F6 in Latin-1.

Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.

See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.

Solution 2

unsigned cha_latin2utf8(unsigned char *dst, unsigned cha)
{
if (cha <  0x80)  { *dst = cha; return 1; }
    /* all 11 bit codepoints (0x0 -- 0x7ff)
      ** fit within a 2byte utf8 char
      ** firstbyte = 110 +xxxxx := 0xc0 + (char>>6) MSB
      ** second    = 10 +xxxxxx := 0x80 + (char& 63) LSB
      */
    *dst++ = 0xc0 | (cha >>6) & 0x1f; /* 2+1+5 bits */
    *dst++ = 0x80 | (cha) & 0x3f; /* 1+1+6 bits */

return 2; /* number of bytes produced */
}

To test it:

#include <stdio.h>
int main (void)
{
char buff[12];

cha_latin2utf8 ( buff, 0xf6);

fprintf(stdout, "%02x %02x\n"
    , (unsigned) buff[0] & 0xff
    , (unsigned) buff[1] & 0xff );

return 0;
}

The result:

c3 b6
Share:
11,560
testing
Author by

testing

Updated on August 01, 2022

Comments

  • testing
    testing over 1 year

    I have the character "ö". If I look in this UTF-8 table I see it has the hex value F6. If I look in the Unicode table I see that "ö" has the indices E0and 16. If I add both I get the hex value of the code point of F6. This is the binary value 1111 0110.

    1) How do I get from the hex value F6 to the indices E0 and 16?
    2) I don't know how to come from F6 to the two bytes C3 B6 ...

    Because I didn't got the results I tried to go the other way. "ö" is represented in ISO-8859-1 as "ö". In the UTF-8 table I can see that "Ã" has the decimal value 195 and "¶" has the decimal value 182. Converted to bits this is 1100 0011 1011 0110.

    Process:

    1. Look in a table and get the unicode for the character "ö". Calculated from the indices E0 and 16 you get the Unicode U+00F6.

    2. According to the algorithm posted by wildplasser you can calculate the coded UTF-8 value C3 and B6.

    3. In the binary form you get 1100 0011 1011 0110 which corresponds to the decimal values 195 and 182.

    4. If these values are interpreted as ISO 8859-1 (only 1 byte) then you get "ö".

    PS: I found also this link, which shows the values from step 2.