Ruby method to remove accents from UTF-8 international characters

35,687

Solution 1

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
 => true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
 => "He les mecs!"

Solution 2

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"

Solution 3

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

Solution 4

If you are using rails:

"L'Oréal".parameterize(separator: ' ')
Share:
35,687

Related videos on Youtube

Gus Shortz
Author by

Gus Shortz

Updated on July 05, 2022

Comments

  • Gus Shortz
    Gus Shortz almost 2 years

    I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

    I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

      # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
      # It relies on the Unicode Hacks plugin by means of String#chars. We assume
      # $KCODE is 'u' in environment.rb. By now we support a wide range of latin
      # accented letters, based on the Unicode Character Palette bundled inMacs.
      def self.normalize(str)
         n = str.chars.downcase.strip.to_s
         n.gsub!(/[à áâãäåÄÄ?]/u,    'a')
         n.gsub!(/æ/u,                  'ae')
         n.gsub!(/[ÄÄ?]/u,                'd')
         n.gsub!(/[çÄ?ÄÄ?Ä?]/u,          'c')
         n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
         n.gsub!(/Æ?/u,                   'f')
         n.gsub!(/[ÄÄ?Ä¡Ä£]/u,            'g')
         n.gsub!(/[ĥħ]/,                'h')
         n.gsub!(/[ììíîïīĩĭ]/u,     'i')
         n.gsub!(/[įıijĵ]/u,           'j')
         n.gsub!(/[ķĸ]/u,               'k')
         n.gsub!(/[Å?ľĺļÅ?]/u,         'l')
         n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u,       'n')
         n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u,  'o')
         n.gsub!(/Å?/u,                  'oe')
         n.gsub!(/Ä?/u,                   'q')
         n.gsub!(/[Å?Å?Å?]/u,             'r')
         n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
         n.gsub!(/[ťţŧÈ?]/u,           't')
         n.gsub!(/[ùúûüūůűŭũų]/u,'u')
         n.gsub!(/ŵ/u,                   'w')
         n.gsub!(/[ýÿŷ]/u,             'y')
         n.gsub!(/[žżź]/u,             'z')
         n.gsub!(/\s+/,                   ' ')
         n.gsub!(/[^\sa-z0-9_-]/,          '')
         n
      end
    

    Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

    I am not using Rails, nor do I plan on doing so.

    • Huluk
      Huluk about 11 years
      Which ruby version are you using?
    • MurifoX
      MurifoX about 11 years
    • amalrik maia
      amalrik maia about 11 years
      you could also look at: github.com/norman/unidecoder
    • Gus Shortz
      Gus Shortz about 11 years
      I'm using Ruby 1.9.3, I'll take a look at both of those possible solutions, all I need is the above method's replacement of the listed characters, so if those solutions can do that great and thanks :)
    • Gus Shortz
      Gus Shortz about 11 years
      I finally found some references to the Unicode Hack plugin (railslodge.com/plugins/316-unicode-hacks), that provides the chars method needed for the normalize method I mentioned. But it seems to no longer be supported
  • Paul Fioravanti
    Paul Fioravanti about 11 years
    The documentation. Being able to set transliterations on a per-locale basis is also very powerful.
  • David
    David about 10 years
    This may not do what you expect on characters that don't have basic Latin mappings--for example Chinese characters. It just turns them to question marks. (main)> I18n.transliterate("雙屬性集合之空間分群演算法-應用於地理資料") => "?????????????-???????"
  • pts
    pts over 9 years
    This works only for ISO-8859-1. What makes you think it works for UTF-8?
  • Alter Lagos
    Alter Lagos almost 9 years
    Just a note for plain ruby , if I18n::InvalidLocale: :en is not a valid locale is thrown, use I18n.available_locales = [:en] before I18n.transliterate
  • CHawk
    CHawk about 8 years
    Note: This does not work for everything. Example "Bùi Viện" gets translated to "Bui Vi?n"
  • Michael
    Michael almost 8 years
    Didn't work for me: (main)> I18n.transliterate "ŠKODA" => "ŠKODA"
  • user2398029
    user2398029 almost 8 years
    Those cases should be reported as I18n bugs.
  • Alexander
    Alexander almost 7 years
    This one works for UTF-8 and ruby 2.2.3, and does exactly what I needed. Lacks some Romanian characters though. I've aded them: string.tr( "ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢ‌​ģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮ‌​įİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñ‌​ŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõö‌​øŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠ‌​šȘșſŢţŤťŦŧȚțÙÚÛÜùúûü‌​ŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸ‌​ŹźŻżŽž", "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgG‌​gHhHhIIIIiiiiIiIiIiI‌​iIiJjKkkLlLlLlLlLlNn‌​NnNnNnnNnOOOOOOooooo‌​oOoOoOoRrRrRrSsSsSsS‌​sSssTtTtTtTtUUUUuuuu‌​UuUuUuUuUuUuWwYyyYyY‌​ZzZzZz")
  • Scarlet
    Scarlet about 6 years
    It depends too much on configuration, I think. Does not work for me too, tried specifying different locales.
  • snowangel
    snowangel over 4 years
    They're not using Rails, though.
  • duyetpt
    duyetpt almost 3 years
    Thanks it worked. Lack some Vietnamese chars. I 've added them: tr("ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêểệễëĒēĔĕĖėĘęĚ‌​ěẹĜĝĞğĠġĢģĤĥĦħÌÍÎÏìí‌​îïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹ‌​ĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋ‌​ÒÓÔÕÖØòóôộỗổõöøŌōŎŏŐ‌​őọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠ‌​šſŢţŤťŦŧÙÚÛÜùúûüŨũŪū‌​ŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶ‌​ŷŸŹźŻżŽžứừửựữốồộỗổờó‌​ợỏỡếềễểệẩẫấầậỳỹýỷỵặẵ‌​ẳằắ", "AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdDdDdEEEEeeeeeeEeEeEeEeEeeGg‌​GgGgGgHhHhIIIIiiiiIi‌​IiIiIiIiiiJjKkkLlLlL‌​lLlLlNnNnNnNnnNnOOOO‌​OOoooooooooOoOoOoooo‌​oooRrRrRrSsSsSsSssTt‌​TtTtUUUUuuuuUuUuUuUu‌​UuUuuuuuuuWwYyyYyYZz‌​ZzZzuuuuuooooooooooe‌​eeeeaaaaayyyyyaaaaa"‌​)
  • Dorian
    Dorian over 2 years
    parameterize uses I18n.transliterate: github.com/rails/rails/blob/main/activesupport/lib/…
  • Wordica
    Wordica about 2 years
    thanks man! great thing xD
  • Julien
    Julien about 2 years
    note that this changes periods '.' into dashes '-'