How to convert UTF-8 to ISO-8859-1 in Ruby 2.0?
The encode
method does work.
Let's create a string with U+00FC (ü
):
uuml_utf8 = "\u00FC" #=> "ü"
Ruby encodes this string in UTF-8:
uuml_utf8.encoding #=> #<Encoding:UTF-8>
In UTF-8, ü
is represented as 195
188
(decimal):
uuml_utf8.bytes #=> [195, 188]
Now let's convert the string to ISO-8859-1:
uuml_latin1 = uuml_utf8.encode("ISO-8859-1")
uuml_latin1.encoding #=> #<Encoding:ISO-8859-1>
In ISO-8859-1, ü
is represented as 252
(decimal):
uuml_latin1.bytes #=> [252]
In UTF-8 however 252
is an invalid sequence. That's why your terminal/console displays the replacement character "�" (U+FFFD) or no character at all.
In order to display ISO-8859-1 encoded characters, you'll have to switch your terminal/console to that encoding, too.
Comments
-
0x4a6f4672 almost 2 years
Timezones for (date)-times and encoding for strings are no problem if you do not have do convert between them. In Ruby 1.9 and 2.0, encodings seem to be the new timezones from older Ruby versions, they cause nothing but trouble. Iconv has been replaced by the native encoding functions. How do you convert from the standard UTF-8 to ISO-8859-1, for example for the use in Windows systems? In the Ruby 2.0 console the encode function does not work, although it should be able to convert from a source encoding to a destination encoding via
encode(dst_encoding, src_encoding) → str
?>> "ABC äöüÄÖÜ".encoding => #<Encoding:UTF-8> >> "ABC äöüÄÖÜ".encode("UTF-8").encode("ISO-8859-1") => "ABC \xE4\xF6\xFC\xC4\xD6\xDC" >> "ABC äöüÄÖÜ".encode("ISO-8859-1","UTF-8") => "ABC \xE4\xF6\xFC\xC4\xD6\xDC"
I am using Ruby 2.0.0 (Revision 41674) on a linux system.
-
0x4a6f4672 over 10 yearsYes, but in your example uuml_latin1 has the value "\xFC" and not the special character "ü". 'print uuml_latin1' gives � , while 'puts uuml_latin1' produces an empty string. Something seems to be wrong, or are the Ruby functions not able to display ISO-8859-1 encodings?
-
0x4a6f4672 over 10 years0xFC is indeed the hex value for 252. This means Ruby 2.0 is not able to display strings with ISO-8859-1 encoding correctly, using the right characters? Why does it work with UTF-8 encoding, but not with ISO-8859-1 encoding?
-
Stefan over 10 yearsRuby doesn't display the strings, your terminal does. Change it from UTF-8 to ISO-8859-1 and you'll see a
ü
. -
0x4a6f4672 over 10 yearsOk, so the reason the encoding seems to wrong is that the terminal/console/bash can not display it, because it has the wrong locale/charset/character map/whatever.
-
Stefan over 10 yearsExactly,
0xFC
is not a valid UTF-8 sequence. It's like opening a ISO-8859-1 file in an UTF-8 editor. -
0x4a6f4672 over 10 yearsCan you update the answer accordingly? Then I can accept it. I finally managed to generate and export the right encoding. The encode function does work, but a) the terminal is not able to display ISO-8859-1 (Latin-1) or ISO-8859-15 (Latin-9) encodings because it uses UTF-8 as default, and b) it has to be used in the right places, for instance if you send it with send_data it is also necessary to call it there
send_data(csv_string.encode("ISO-8859-15"), :type => 'text/csv;charset=ISO-8859-15')
stackoverflow.com/questions/9639153/… -
Arup Rakshit over 10 yearsVery good explanation!! +1
-
Arup Rakshit over 10 yearsHow do you know In UTF-8 however 252 is an invalid sequence ? asking out of curiosity ?
-
Stefan over 10 years@ArupRakshit en.wikipedia.org/wiki/UTF-8#Codepage_layout 192-193 and 245-255 (the red cells) are invalid
-
Arup Rakshit over 10 years@Stefan Thanks!! Got it now..