NSString - Convert to pure alphabet only (i.e. remove accents+punctuation)

26,958

Solution 1

NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];

Solution 2

Before using any of these solutions, don't forget to use decomposedStringWithCanonicalMapping to decompose any accented letters. This will turn, for example, é (U+00E9) into e ‌́ (U+0065 U+0301). Then, when you strip out the non-alphanumeric characters, the unaccented letters will remain.

The reason why this is important is that you probably don't want, say, “dän” and “dün”* to be treated as the same. If you stripped out all accented letters, as some of these solutions may do, you'll end up with “dn”, so those strings will compare as equal.

So, you should decompose them first, so that you can strip the accents and leave the letters.

*Example from German. Thanks to Joris Weimar for providing it.

Solution 3

On a similar question, Ole Begemann suggests using stringByFoldingWithOptions: and I believe this is the best solution here:

NSString *accentedString = @"ÁlgeBra";
NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];

Depending on the nature of the strings you want to convert, you might want to set a fixed locale (e.g. English) instead of using the user's current locale. That way, you can be sure to get the same results on every machine.

Solution 4

If you are trying to compare strings, use one of these methods. Don't try to change data.

- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale

You NEED to consider user locale to do things write with strings, particularly things like names. In most languages, characters like ä and å are not the same other than they look similar. They are inherently distinct characters with meaning distinct from others, but the actual rules and semantics are distinct to each locale.

The correct way to compare and sort strings is by considering the user's locale. Anything else is naive, wrong and very 1990's. Stop doing it.

If you are trying to pass data to a system that cannot support non-ASCII, well, this is just a wrong thing to do. Pass it as data blobs.

https://developer.apple.com/library/ios/documentation/cocoa/Conceptual/Strings/Articles/SearchingStrings.html

Plus normalizing your strings first (see Peter Hosey's post) precomposing or decomposing, basically pick a normalized form.

- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping

No, it's not nearly as simple and easy as we tend to think. Yes, it requires informed and careful decision making. (and a bit of non-English language experience helps)

Solution 5

One important precision over the answer of BillyTheKid18756 (that was corrected by Luiz but it was not obvious in the explanation of the code):

DO NOT USE stringWithCString as a second step to remove accents, it can add unwanted characters at the end of your string as the NSData is not NULL-terminated (as stringWithCString expects it). Or use it and add an additional NULL byte to your NSData, like Luiz did in his code.

I think a simpler answer is to replace:

NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

By:

NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

If I take back the code of BillyTheKid18756, here is the complete correct code:

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
Share:
26,958

Related videos on Youtube

Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm trying to compare names without any punctuation, spaces, accents etc. At the moment I am doing the following:

    -(NSString*) prepareString:(NSString*)a {
        //remove any accents and punctuation;
        a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];
    
        a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
        a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
        a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
        a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
        a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
        a=[a lowercaseString];
        return a;
    }
    

    However, I need to do this for hundreds of strings and I need to make this more efficient. Any ideas?

  • Quinn Taylor
    Quinn Taylor almost 15 years
    I think Peter is trying to demonstrate 2 words with the same letters and different accents. :-)
  • Daredzik
    Daredzik over 12 years
    Just a note that this does work, but with one minor tweak: dataUsingEncoding returns NSData, not NSMutableData, so you have to do [[[text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] mutableCopy] autorelease]
  • Mike Keskinov
    Mike Keskinov almost 12 years
    This also will remove all non-ASCII letters like in 'жопень'
  • DZenBot
    DZenBot over 11 years
    Awesome! You made my day man. Since stringWithCString is deprecated, you must use stringWithCString:encoding instead. I used NSASCIIStringEncoding as well and it worked fine!
  • Ilker Baltaci
    Ilker Baltaci over 11 years
    [sanitizedData increaseLengthBy:1]; is crashing the app
  • Daniel S.
    Daniel S. over 10 years
    Funny German example. :D It's not German (Danish is "dänisch" in German), but it's still a nice example for outlining the problem. dict.leo.org/#/search=Danish
  • uchuugaka
    uchuugaka over 10 years
    So the common misunderstanding in English is assuming that those are in fact the same letter with different accents. In English they are often perceived as such, but with the proper locale consideration those are different letters in other locales. That's the inherent problem with this question. It's a naive and wrong approach to sorting.
  • ElmerCat
    ElmerCat over 9 years
    I like your answer, but I adapted it to work a little differently, with a string of allowed characters instead of a disallowed character set.
  • Edgar Carvalho
    Edgar Carvalho over 9 years
    I totally agree. Simple replace or regex doesn't make sense if you know other languages. The code should never contain language specific characters directly like an array of characters to replace etc. if it is not natively supported, try to find a library. Fortunately, obj c comes with good support for localization.
  • uchuugaka
    uchuugaka over 9 years
    Some of the best language support in an API.