Removing duplicates in a string in Python

13,095

Solution 1

You use a hashtable to store currently discovered keys (access O(1)) and then loop through the array. If a character is in the hashtable, discard it. If it isn't add it to the hashtable and a result string.

Overall: O(n) time (and space).

The naive solution is to search for the character is the result string as you process each one. That O(n2).

Solution 2

In Python

>>> ''.join(set("aaaabbbccdbdbcd"))
'acbd'

If the order needs to be preserved

>>> q="aaaabbbccdbdbcd"                    # this one is not
>>> ''.join(sorted(set(q),key=q.index))    # so efficient
'abcd'

or

>>> S=set()
>>> res=""
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   res+=c
...   S.add(c)
... 
>>> res
'abcd'

or

>>> S=set()
>>> L=[]
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   L.append(c)
...   S.add(c)
... 
>>> ''.join(L)
'abcd'

In python3.1

>>> from collections import OrderedDict
>>> ''.join(list(OrderedDict((c,0) for c in "aaaabbbccdbdbcd").keys()))
'abcd'

Solution 3

This closely related to the question: Detecting repetition with infinite input.

The hashtable approach may not be optimal depending on your input. Hashtables have a certain amount of overhead (buckets, entry objects). It is huge overhead compared to the actual stored char. (If you target environment is Java it is even worse as the HashMap is of type Map<Character,?>.) The worse case runtime for a Hashtable access is O(n) due to collisions.

You need only 8kb too represent all 2-byte unicode characters in a plain BitSet. This may be optimized if your input character set is more restricted or by using a compressed BitSets (as long as you have a sparse BitSet). The runtime performance will be favorable for a BitSet it is O(1).

Solution 4

Keep an array of 256 "seen" booleans, one for each possible character. Stream your string. If you haven't seen the character before, output it and set the "seen" flag for that character.

Solution 5

PHP algorythm - O(n):

function remove_duplicate_chars($str) {
    if (2 > $len = strlen($str)) {
        return $str;
    }
    $flags = array_fill(0,256,false);
    $flags[ord($str[0])]=true;
    $j = 1;
    for ($i=1; $i<$len; $i++) {
        $ord = ord($str[$i]);
        if (!$flags[$ord]) {
            $str[$j] = $str[$i];
            $j++;
            $flags[$ord] = true;
        }
    }
    if ($j<$i) { //if duplicates removed
        $str = substr($str,0,$j);
    }
    return $str;
}

echo remove_duplicate_chars('aaaabbbccdbdbcd'); // result: 'abcd'
Share:
13,095
SuperString
Author by

SuperString

Hello World!

Updated on June 09, 2022

Comments

  • SuperString
    SuperString almost 2 years

    What is an efficient algorithm to removing all duplicates in a string?

    For example : aaaabbbccdbdbcd

    Required result: abcd

  • Adriaan Stander
    Adriaan Stander about 14 years
    +1, Or if they have accss to it HashSet msdn.microsoft.com/en-us/library/bb495294.aspx
  • Rob Fonseca-Ensor
    Rob Fonseca-Ensor about 14 years
    O(n^2) isn't very efficient... (once the data set gets big enough). For small strings this is probably faster than a hashset based lookup though
  • Admin
    Admin about 14 years
    It has not been told what coding is used, though
  • Amgad Fahmi
    Amgad Fahmi about 14 years
    i agree , what about the new one ?
  • Carson Myers
    Carson Myers about 14 years
    I knew set would be awesome for this, but I'm new to python and was trying to figure out how to join them while you posted this... Now I know!
  • Ritsaert Hornstra
    Ritsaert Hornstra about 14 years
    If you have a large string compared the possible # vakues of the characters (eg like if it is ASCII), you might use a =n array of bools instead on a hashtable
  • Thomas Jung
    Thomas Jung about 14 years
    The best case to retrieve a value from hashtable is O(1) and the worst case O(n). The overall worst case complexity for the algorithm is O(n^2).
  • cjk
    cjk about 14 years
    String concatenation in a loop will be slower than the searching of the character within the string...
  • jk.
    jk. about 14 years
    that is irrelavent in this case as you by definition of the algo have either 0 or 1 item for each hash key
  • Thomas Jung
    Thomas Jung about 14 years
    @jk The hashtable has always 0 or 1 entries for a key. The worst case O(n) is that all n values are in one bucket.
  • Matthieu M.
    Matthieu M. about 14 years
    @Thomas Jung: For this problem computing a Perfect Hashing function is easy (typically the ASCII or at worst the Unicode Code Point value) therefore you perform access in O(1).
  • Matthieu M.
    Matthieu M. about 14 years
    I am afraid to mention that you are mixing (somehow) concepts and implementations. I view the fact that you are using a BitSet to implement your own HashTable as a proof that the HashTable is a perfectly viable solution.
  • Thomas Jung
    Thomas Jung about 14 years
    @Matthieu Not Exactly. Suppose you have perfect hash function from char -> 2 byte Int. This is easy. Now your Hashtable size is smaller than 2^16. Say 15. When you enter 2 values it is quite probable that you will have a collision (1/15 for the second value). The index is some complicated version of idx = hash % size. If you want absolutely no collisions you have to create a hashtable of size 2^16.
  • Thomas Jung
    Thomas Jung about 14 years
    @Matthieu Using a Hashtable or a BitSet has certain trade-offs. The hashtable works best for small sets of characters. The BitSet works best when the number of characters is large or can be restricted to a known range. A BitSet is not a Hashtable. The Hashtable here is used as a Set as someone mentioned. The BitSet is used analogous. If you can replace one with the other does not mean that they are equally good solutions.
  • Thomas Jung
    Thomas Jung about 14 years
    @Matthieu I've realized that I've cut corners a bit. You can of course create a perfect hash function for hashtables with size < 2^16. Cuckoo hashing has for example O(1) worst case access complexity but worst case O(n) for puts. I suppose there is no hashtable that has worst case complexity of O(1) for all operations.
  • Matthieu M.
    Matthieu M. about 14 years
    It depends on the input size: if you can manage to have an upper-bound for the size of a bucket, then you can always pretend to be O(1) even though it could be daunting :x Here it seems easy enough for ASCII charachters (256 of them) and of course a bit more difficult if you wish to take all the Unicode Points into account, yet with a sufficiently big bitset you could have good performance without too much memory (server-scale)
  • John La Rooy
    John La Rooy about 14 years
    @recursive, I added some order preserving options
  • letsc
    letsc over 12 years
    O(1) space??? I see a vector of bools.. Isnt is same as an array of bools of 256 characters??
  • JoeG
    JoeG over 12 years
    @smartmuki: It's O(1) space because the size of the vector<bool> does not vary according to the size of the input - it's 256 bools no matter what the input is.
  • syntagma
    syntagma over 10 years
    What does the arr[s[i]] == true mean?
  • TheMan
    TheMan over 10 years
    Just having arr[s[i]] there would do too. Is that what you meant?
  • syntagma
    syntagma over 10 years
    arr[s[i]] is what I'm interested in - do I understand correctly that you use char (which is an int) to index the array?
  • TheMan
    TheMan over 10 years
    Yes, that's right and it should work. What's the issue with that?