Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

11,668

Solution 1

it also removes spaces, dots, dashes, and who knows what else.

It shouldn't.

string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s

You've mistyped, there should be a backslash before the x00, to refer to the NUL character.

/[^\-x00-\x7F]/n # So it would leave the dash alone

You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.

Solution 2

I'd use the transliterate method. See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

Solution 3

It's not as neat as Iconv, but does what I think you want:

http://snippets.dzone.com/posts/show/2384

Share:
11,668
Ivan
Author by

Ivan

Programmer since 1995. Happy Railer for the past 9+ years.

Updated on July 18, 2022

Comments

  • Ivan
    Ivan almost 2 years

    There is a very similar question already. One of the solutions uses code like this one:

    string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
    

    Which works wonders, until you notice it also removes spaces, dots, dashes, and who knows what else.

    I'm not really sure how the first code works, but could it be made to strip only accents? Or at the very least be given a list of chars to preserve? My knowledge of regexps is small, but I tried (to no avail):

    /[^\-x00-\x7F]/n # So it would leave the dash alone
    

    I'm about to do something like this:

    string.mb_chars.normalize(:kd).gsub('-', '__DASH__').gsub
      (/[^x00-\x7F]/n, '').gsub('__DASH__', '-').to_s
    

    Atrocious? Yes...

    I've also tried:

    iconv = Iconv.new('UTF-8', 'US-ASCII//TRANSLIT') # Also tried ISO-8859-1
    iconv.iconv 'Café' # Throws an error: Iconv::IllegalSequence: "é"
    

    Help please?

  • Ivan
    Ivan about 15 years
    Oh dear lord... please forgive me :) Thanks!
  • Mr_Nizzle
    Mr_Nizzle over 12 years
    What about the spaces? it's not preserving white spaces.