How to convert UTF-8 to ISO-8859-1 in Ruby 2.0?

11,943

The encode method does work.

Let's create a string with U+00FC (ü):

uuml_utf8 = "\u00FC"       #=> "ü"

Ruby encodes this string in UTF-8:

uuml_utf8.encoding         #=> #<Encoding:UTF-8>

In UTF-8, ü is represented as 195 188 (decimal):

uuml_utf8.bytes            #=> [195, 188]

Now let's convert the string to ISO-8859-1:

uuml_latin1 = uuml_utf8.encode("ISO-8859-1")

uuml_latin1.encoding       #=> #<Encoding:ISO-8859-1>

In ISO-8859-1, ü is represented as 252 (decimal):

uuml_latin1.bytes          #=> [252]

In UTF-8 however 252 is an invalid sequence. That's why your terminal/console displays the replacement character "�" (U+FFFD) or no character at all.

In order to display ISO-8859-1 encoded characters, you'll have to switch your terminal/console to that encoding, too.

Share:
11,943
0x4a6f4672
Author by

0x4a6f4672

You can find me via Twitter @jofr

Updated on June 04, 2022

Comments

  • 0x4a6f4672
    0x4a6f4672 almost 2 years

    Timezones for (date)-times and encoding for strings are no problem if you do not have do convert between them. In Ruby 1.9 and 2.0, encodings seem to be the new timezones from older Ruby versions, they cause nothing but trouble. Iconv has been replaced by the native encoding functions. How do you convert from the standard UTF-8 to ISO-8859-1, for example for the use in Windows systems? In the Ruby 2.0 console the encode function does not work, although it should be able to convert from a source encoding to a destination encoding via encode(dst_encoding, src_encoding) → str?

    >> "ABC äöüÄÖÜ".encoding
    => #<Encoding:UTF-8>
    >> "ABC äöüÄÖÜ".encode("UTF-8").encode("ISO-8859-1")
    => "ABC \xE4\xF6\xFC\xC4\xD6\xDC"
    >> "ABC äöüÄÖÜ".encode("ISO-8859-1","UTF-8")
    => "ABC \xE4\xF6\xFC\xC4\xD6\xDC"
    

    I am using Ruby 2.0.0 (Revision 41674) on a linux system.

  • 0x4a6f4672
    0x4a6f4672 over 10 years
    Yes, but in your example uuml_latin1 has the value "\xFC" and not the special character "ü". 'print uuml_latin1' gives � , while 'puts uuml_latin1' produces an empty string. Something seems to be wrong, or are the Ruby functions not able to display ISO-8859-1 encodings?
  • 0x4a6f4672
    0x4a6f4672 over 10 years
    0xFC is indeed the hex value for 252. This means Ruby 2.0 is not able to display strings with ISO-8859-1 encoding correctly, using the right characters? Why does it work with UTF-8 encoding, but not with ISO-8859-1 encoding?
  • Stefan
    Stefan over 10 years
    Ruby doesn't display the strings, your terminal does. Change it from UTF-8 to ISO-8859-1 and you'll see a ü.
  • 0x4a6f4672
    0x4a6f4672 over 10 years
    Ok, so the reason the encoding seems to wrong is that the terminal/console/bash can not display it, because it has the wrong locale/charset/character map/whatever.
  • Stefan
    Stefan over 10 years
    Exactly, 0xFC is not a valid UTF-8 sequence. It's like opening a ISO-8859-1 file in an UTF-8 editor.
  • 0x4a6f4672
    0x4a6f4672 over 10 years
    Can you update the answer accordingly? Then I can accept it. I finally managed to generate and export the right encoding. The encode function does work, but a) the terminal is not able to display ISO-8859-1 (Latin-1) or ISO-8859-15 (Latin-9) encodings because it uses UTF-8 as default, and b) it has to be used in the right places, for instance if you send it with send_data it is also necessary to call it there send_data(csv_string.encode("ISO-8859-15"), :type => 'text/csv;charset=ISO-8859-15') stackoverflow.com/questions/9639153/…
  • Arup Rakshit
    Arup Rakshit over 10 years
    Very good explanation!! +1
  • Arup Rakshit
    Arup Rakshit over 10 years
    How do you know In UTF-8 however 252 is an invalid sequence ? asking out of curiosity ?
  • Stefan
    Stefan over 10 years
    @ArupRakshit en.wikipedia.org/wiki/UTF-8#Codepage_layout 192-193 and 245-255 (the red cells) are invalid
  • Arup Rakshit
    Arup Rakshit over 10 years
    @Stefan Thanks!! Got it now..