Ruby method to remove accents from UTF-8 international characters
Solution 1
I generally use I18n to handle this:
1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"
Solution 2
The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:
> "Françoise Isaïe".parameterize
=> "francoise-isaie"
Solution 3
So far the following is the only way I've been able to accomplish what I need:
str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")
But using this feels very 'hackish', and I would love to find a better way.
Solution 4
If you are using rails:
"L'Oréal".parameterize(separator: ' ')
Related videos on Youtube
Gus Shortz
Updated on July 05, 2022Comments
-
Gus Shortz almost 2 years
I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.
I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.
# Utility method that retursn an ASCIIfied, downcased, and sanitized string. # It relies on the Unicode Hacks plugin by means of String#chars. We assume # $KCODE is 'u' in environment.rb. By now we support a wide range of latin # accented letters, based on the Unicode Character Palette bundled inMacs. def self.normalize(str) n = str.chars.downcase.strip.to_s n.gsub!(/[à áâãäåÄÄ?]/u, 'a') n.gsub!(/æ/u, 'ae') n.gsub!(/[ÄÄ?]/u, 'd') n.gsub!(/[çÄ?ÄÄ?Ä?]/u, 'c') n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e') n.gsub!(/Æ?/u, 'f') n.gsub!(/[ÄÄ?Ä¡Ä£]/u, 'g') n.gsub!(/[ĥħ]/, 'h') n.gsub!(/[ììÃîïīĩÄ]/u, 'i') n.gsub!(/[įıijĵ]/u, 'j') n.gsub!(/[ķĸ]/u, 'k') n.gsub!(/[Å?ľĺļÅ?]/u, 'l') n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u, 'n') n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u, 'o') n.gsub!(/Å?/u, 'oe') n.gsub!(/Ä?/u, 'q') n.gsub!(/[Å?Å?Å?]/u, 'r') n.gsub!(/[Å?Å¡Å?ÅÈ?]/u, 's') n.gsub!(/[ťţŧÈ?]/u, 't') n.gsub!(/[ùúûüūůűÅũų]/u,'u') n.gsub!(/ŵ/u, 'w') n.gsub!(/[ýÿŷ]/u, 'y') n.gsub!(/[žżź]/u, 'z') n.gsub!(/\s+/, ' ') n.gsub!(/[^\sa-z0-9_-]/, '') n end
Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.
I am not using Rails, nor do I plan on doing so.
-
Huluk about 11 yearsWhich ruby version are you using?
-
MurifoX about 11 yearsTake a look at stackoverflow.com/questions/1268289/…
-
amalrik maia about 11 yearsyou could also look at: github.com/norman/unidecoder
-
Gus Shortz about 11 yearsI'm using Ruby 1.9.3, I'll take a look at both of those possible solutions, all I need is the above method's replacement of the listed characters, so if those solutions can do that great and thanks :)
-
Gus Shortz about 11 yearsI finally found some references to the Unicode Hack plugin (railslodge.com/plugins/316-unicode-hacks), that provides the
chars
method needed for thenormalize
method I mentioned. But it seems to no longer be supported
-
-
Paul Fioravanti about 11 yearsThe documentation. Being able to set transliterations on a per-locale basis is also very powerful.
-
David about 10 yearsThis may not do what you expect on characters that don't have basic Latin mappings--for example Chinese characters. It just turns them to question marks.
(main)> I18n.transliterate("雙屬性集合之空間分群演算法-應用於地理資料")
=> "?????????????-???????"
-
pts over 9 yearsThis works only for ISO-8859-1. What makes you think it works for UTF-8?
-
Alter Lagos almost 9 yearsJust a note for plain ruby , if
I18n::InvalidLocale: :en is not a valid locale
is thrown, useI18n.available_locales = [:en]
beforeI18n.transliterate
-
CHawk about 8 yearsNote: This does not work for everything. Example "Bùi Viện" gets translated to "Bui Vi?n"
-
Michael almost 8 yearsDidn't work for me:
(main)> I18n.transliterate "ŠKODA" => "ŠKODA"
-
user2398029 almost 8 yearsThose cases should be reported as I18n bugs.
-
Alexander almost 7 yearsThis one works for UTF-8 and ruby 2.2.3, and does exactly what I needed. Lacks some Romanian characters though. I've aded them:
string.tr( "ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšȘșſŢţŤťŦŧȚțÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž", "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSsSssTtTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")
-
Scarlet about 6 yearsIt depends too much on configuration, I think. Does not work for me too, tried specifying different locales.
-
snowangel over 4 yearsThey're not using Rails, though.
-
duyetpt almost 3 yearsThanks it worked. Lack some Vietnamese chars. I 've added them:
tr("ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôộỗổõöøŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ", "AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdDdDdEEEEeeeeeeEeEeEeEeEeeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiiiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOoooooooooOoOoOoooooooRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa")
-
Dorian over 2 years
-
Wordica about 2 yearsthanks man! great thing xD
-
Julien about 2 yearsnote that this changes periods '
.
' into dashes '-
'