Algorithm: efficient way to remove duplicate integers from an array

204,958

Solution 1

How about:

void rmdup(int *array, int length)
{
    int *current , *end = array + length - 1;

    for ( current = array + 1; array < end; array++, current = array + 1 )
    {
        while ( current <= end )
        {
            if ( *current == *array )
            {
                *current = *end--;
            }
            else
            {
                current++;
            }
        }
    }
}

Should be O(n^2) or less.

Solution 2

A solution suggested by my girlfriend is a variation of merge sort. The only modification is that during the merge step, just disregard duplicated values. This solution would be as well O(n log n). In this approach, the sorting/duplication removal are combined together. However, I'm not sure if that makes any difference, though.

Solution 3

I've posted this once before on SO, but I'll reproduce it here because it's pretty cool. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

  1. Take the first element of the array, this will be the sentinel.
  2. Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
  3. Move all elements for which the index is equal to the hash to the beginning of the array.
  4. Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
  5. What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

Here's an implementation in the D programming language:

void uniqueInPlace(T)(ref T[] dataIn) {
    uniqueInPlaceImpl(dataIn, 0);
}

void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
    if(dataIn.length - start < 2)
        return;

    invariant T sentinel = dataIn[start];
    T[] data = dataIn[start + 1..$];

    static hash_t getHash(T elem) {
        static if(is(T == uint) || is(T == int)) {
            return cast(hash_t) elem;
        } else static if(__traits(compiles, elem.toHash)) {
            return elem.toHash;
        } else {
            static auto ti = typeid(typeof(elem));
            return ti.getHash(&elem);
        }
    }

    for(size_t index = 0; index < data.length;) {
        if(data[index] == sentinel) {
            index++;
            continue;
        }

        auto hash = getHash(data[index]) % data.length;
        if(index == hash) {
            index++;
            continue;
        }

        if(data[index] == data[hash]) {
            data[index] = sentinel;
            index++;
            continue;
        }

        if(data[hash] == sentinel) {
            swap(data[hash], data[index]);
            index++;
            continue;
        }

        auto hashHash = getHash(data[hash]) % data.length;
        if(hashHash != hash) {
            swap(data[index], data[hash]);
            if(hash < index)
                index++;
        } else {
            index++;
        }
    }


    size_t swapPos = 0;
    foreach(i; 0..data.length) {
        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
            swap(data[i], data[swapPos++]);
        }
    }

    size_t sentinelPos = data.length;
    for(size_t i = swapPos; i < sentinelPos;) {
        if(data[i] == sentinel) {
            swap(data[i], data[--sentinelPos]);
        } else {
            i++;
        }
    }

    dataIn = dataIn[0..sentinelPos + start + 1];
    uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}

Solution 4

One more efficient implementation

int i, j;

/* new length of modified array */
int NewLength = 1;

for(i=1; i< Length; i++){

   for(j=0; j< NewLength ; j++)
   {

      if(array[i] == array[j])
      break;
   }

   /* if none of the values in index[0..j] of array is not same as array[i],
      then copy the current value to corresponding new position in array */

  if (j==NewLength )
      array[NewLength++] = array[i];
}

In this implementation there is no need for sorting the array. Also if a duplicate element is found, there is no need for shifting all elements after this by one position.

The output of this code is array[] with size NewLength

Here we are starting from the 2nd elemt in array and comparing it with all the elements in array up to this array. We are holding an extra index variable 'NewLength' for modifying the input array. NewLength variabel is initialized to 0.

Element in array[1] will be compared with array[0]. If they are different, then value in array[NewLength] will be modified with array[1] and increment NewLength. If they are same, NewLength will not be modified.

So if we have an array [1 2 1 3 1], then

In First pass of 'j' loop, array[1] (2) will be compared with array0, then 2 will be written to array[NewLength] = array[1] so array will be [1 2] since NewLength = 2

In second pass of 'j' loop, array[2] (1) will be compared with array0 and array1. Here since array[2] (1) and array0 are same loop will break here. so array will be [1 2] since NewLength = 2

and so on

Solution 5

If you are looking for the superior O-notation, then sorting the array with an O(n log n) sort then doing a O(n) traversal may be the best route. Without sorting, you are looking at O(n^2).

Edit: if you are just doing integers, then you can also do radix sort to get O(n).

Share:
204,958
fujy
Author by

fujy

Updated on February 05, 2020

Comments

  • fujy
    fujy over 4 years

    I got this problem from an interview with Microsoft.

    Given an array of random integers, write an algorithm in C that removes duplicated numbers and return the unique numbers in the original array.

    E.g Input: {4, 8, 4, 1, 1, 2, 9} Output: {4, 8, 1, 2, 9, ?, ?}

    One caveat is that the expected algorithm should not required the array to be sorted first. And when an element has been removed, the following elements must be shifted forward as well. Anyway, value of elements at the tail of the array where elements were shifted forward are negligible.

    Update: The result must be returned in the original array and helper data structure (e.g. hashtable) should not be used. However, I guess order preservation is not necessary.

    Update2: For those who wonder why these impractical constraints, this was an interview question and all these constraints are discussed during the thinking process to see how I can come up with different ideas.

  • Douglas Leeder
    Douglas Leeder over 14 years
    Indeed - an O(n) pass to find the largest element wouldn't increase the overall O() cost.
  • Douglas Leeder
    Douglas Leeder over 14 years
    It's not clear if the answer has to be in the original array.
  • ChrisW
    ChrisW over 14 years
    Jeff B's answer is merely O(n). Hash-sets and hash-dictionaries are the bees knees.
  • Jeff B
    Jeff B over 14 years
    To do this without requiring a new array, you could simply replace the duplicate with an element popped off the end of the array, and redo the current loop, as the problem does not specify that order matters. This requires some extra bounds checking, but is very do-able.
  • sbi
    sbi over 14 years
    "One caveat is that the expected algorithm should not required the array to be sorted first."
  • Laurence Gonsalves
    Laurence Gonsalves over 14 years
    ChrisW: hash sets/dictionaries are only O(1) if you assume no collisions. (I'm not saying I wouldn't use them for this problem -- I probably would -- it's just a fallacy to claim that they're truly O(1).)
  • Greg Rogers
    Greg Rogers over 14 years
    It doesn't say you can't sort the array once you get it... Without using O(N) external memory sorting is the only way to do it in O(N log N) or better.
  • Vitali
    Vitali over 14 years
    Actually, since you know the size of the array before-hand, you can guarantee O(1). Then you can trade off collisions vs how much additional memory you use.
  • fujy
    fujy over 14 years
    For the purpose of the problem, standard library utils should not be used. Regarding sorting, however, the more I think of it, the more unsure I am whether it is ok or not.
  • Mark Ransom
    Mark Ransom over 14 years
    You might want to rethink that downvote - newly posted conditions to the problem make Jeff B's solution invalid.
  • Steve Jessop
    Steve Jessop over 14 years
    Given that they're only integers, for simplicity you could assume 32bit and not bother looking for min/max: 2^32 bits is "only" 512MB, so finding the bounds is just a memory-use and O(1) time optimisation (granted, a hefty optimisation in the case of the example given). And if they're 64bit, it's irrelevant since you don't know that the min and max won't be further apart than the number of bits of memory you have.
  • Mark Ransom
    Mark Ransom over 14 years
    You might want to elaborate on "traversal", since a naive erasure method might result in O(n^2) for large numbers of duplicates.
  • newpxsn
    newpxsn over 14 years
    Jeff's is still fine, you can do the comparison and motion in place. Take a look at my (sadly ignored) fully correct solution below.
  • Mark Ransom
    Mark Ransom over 14 years
    Great suggestion, but you'll need some bookkeeping to keep track of the end of each merge output. I actually did this once, and yes eliminating the duplicates as you merge makes it much faster.
  • Isabelle Wedin
    Isabelle Wedin over 14 years
    This was a good idea, until the question was edited. Your hashtable idea is apparently against the rules.
  • Mark Ransom
    Mark Ransom over 14 years
    In the modification to the original question, hash tables aren't allowed. Your two pointer approach is a nice way to compact the output once you've identified the duplicates, though.
  • LiraNuna
    LiraNuna over 14 years
    I don't get why this answer gets voted the most. It's written in perl and uses vital features not available in C, as the question asks.
  • mocj
    mocj over 14 years
    If you overwrite the duplicates you find with the value at the end of the array you can avoid the shifting of the whole array in your inner for() loop. That will bring you to O(n^2) from O(n^3). My C implementation is floating around here somewhere...
  • Dominik
    Dominik over 14 years
    I thought, shifting was part of the requirement, but you are right of course.
  • Dominik
    Dominik over 14 years
    @mocj: I like your solution, looks very elegant. But I think it doesn't work if the last two elements are equal, because you stop checking for equality one before the last. (comenting here because have too view reputation to comment anywhere else :( )
  • mocj
    mocj over 14 years
    You're right except that the original problem states that the values at the end of the array are negligible. Since you aren't returning the length of the modified array the distinction between the last value and the second to last is unimportant when the two values are equal. Where does the caller interpret the end of the returned array to be
  • LiraNuna
    LiraNuna over 14 years
    Theory aside, wouldn't allocating 512MB take more time than finding the min/max?
  • Steve Jessop
    Steve Jessop over 14 years
    Depends how much data there is, and what the min/max are. If you're looking at more than 512MB of input, then quite possibly it's faster to avoid that extra O(N) pass. Of course if you're looking at that much input, then it's less likely you have 512MB to spare. In cases where the min/max are close to 0/INT_MAX, then the optimisation doesn't help either. I'm just saying that although the first step obviously helps for small numbers, it can't avoid the fact that this algorithm uses UINT_MAX bits in the worst case, so you need to plan for that limitation.
  • Douglas Leeder
    Douglas Leeder over 14 years
    You may well be right - in any case clarification of the question means that using a bit-array is out. I'll leave this answer in case someone comes along later without the constraints and wants to view all possible answers.
  • Douglas Leeder
    Douglas Leeder over 14 years
    I think answers refering to C++ and C++ standard functions are useful, even if they don't answer the original question, as they provide a more rounded answer to people who find this question later.
  • Steve Jessop
    Steve Jessop over 14 years
    It's not clear whether O(N/2) extra space counts as the "helper data structure" banned in the question - I don't know whether the restriction is intended to stipulate O(1) extra space, or just to stipulate that the answer should not depend on a big ol' data structure implementation. Maybe a standard merge is fine. But if not, top tip: do not attempt to write an in-place merge sort in an interview, unless you really know what you're doing.
  • Kirk Broadhurst
    Kirk Broadhurst over 14 years
    This is the simple solution and is more than likely what the interview question is looking for.
  • Trevor Tippins
    Trevor Tippins over 14 years
    They might even be checking to see that you don't suffer from indulging in premature optimization unless they've given you runtime constraints too! :-)
  • IssEaPharma
    IssEaPharma over 14 years
    Nice one. I have a suggestion to improve. Second nested loop can be changed to for(j=0; j < NewLength; j++) and last if checking can be changed to if (j == NewLength )
  • Byju
    Byju over 14 years
    That was a great suggession. I have updated the code based on ur comment
  • Peter Recore
    Peter Recore over 14 years
    the question asked for c code, not perl. using perl gets you hashtables and "push" for free. If i could do it in scala you would just call input.removeDuplicates, but i doubt that would have been acceptable to the interviewers :)
  • fujy
    fujy over 14 years
    This solution does not need pre-sorting and is what I was looking for.
  • ziggystar
    ziggystar over 14 years
    Lol, though it's definately faster to sort the array and work on the sorted one. Sorting should be provided by an API and is imho no premature optimization.
  • TMN
    TMN almost 14 years
    OTOH, the additional code required to implement a sorting tree might be less (memory) efficient than the simple solution, and is probably less efficient at run-time for small (say fewer than 100 elements) arrays.
  • Eastern Monk
    Eastern Monk over 12 years
    The 'seen' feature essentially a function of 'n' hence this algorithm is quadratic thought might appear to be linear.
  • dldnh
    dldnh about 12 years
    care to elaborate or provide a reference?
  • Blastfurnace
    Blastfurnace almost 12 years
    Test this with the input arrayInteger = {100,10,1};
  • David Gorsline
    David Gorsline over 11 years
    The OP specified that helper data structures be avoided, and that the results be returned in the original array. Your answer, unfortunately, does not meet these requirements.
  • Ankit Roy
    Ankit Roy over 11 years
    Extremely cool, underrated answer! I like the idea of using the element in position 1 as a sentinel value. If I could make a couple of small suggestions, it would be to change step 2 to include "each element is in the position corresponding to its hash modulo the array size", and perhaps clarify that the duplicates to be set to the sentinel are the elements that have the same value (as opposed to the same hash, or the same hash modulo array size).
  • Shail
    Shail about 11 years
    Shouldn't it be while ( current <= end ) instead of while ( current < end )?
  • Hardy Feng
    Hardy Feng almost 11 years
    Great idea. But it requires that remaining data keeps original order.
  • Yuriy Chernyshov
    Yuriy Chernyshov almost 11 years
    Fail at least if we have same values in array {1,1,1,1,1,1}. Useless code.
  • Yuriy Chernyshov
    Yuriy Chernyshov almost 11 years
    Fails at least with next inputs: {1,1,1,1,1,1,1} {0,0,0,0,0,1,1,1,1,1,1}
  • kumar
    kumar over 10 years
    This is not taking care about all inputs.for eg, 12223 will not be solved by this solution.
  • Admin
    Admin over 10 years
    @unkulunkulu o.O WAT??
  • SheetJS
    SheetJS over 10 years
    question is about C; this answer is C++
  • Pawan
    Pawan about 10 years
    Why was this accepted as the right answer? If order preservation is not necessary then is it not better to just use merge sort O(nlogn) and then remove the repeated elements in O(n)... total complexity - O(nlogn) which is much better than this solution.
  • laksbv
    laksbv about 10 years
    This solution is not working.. and no way we can find the length of new array
  • JavaSa
    JavaSa about 10 years
    Well what is the complexity of this, isn't it also O(n^2)?
  • fujy
    fujy almost 10 years
    The question does not allow extra data structure to be used.
  • Mike B
    Mike B over 9 years
    A paper that describes what your girlfriend suggested follows: dc-pubs.dbs.uni-leipzig.de/files/…
  • greybeard
    greybeard over 9 years
    Which question does this answer? How does it remove (possibly many) duplicates, in sublinear space, without hashing? Please elaborate do xor operation by traversing [...] elements.
  • Jonathan Leffler
    Jonathan Leffler over 9 years
    If you change the return type to int, and add int *base = array; at the top of the function and return current - base + 1; at the end, then you know how many entries exist in the compressed table. With that modification, the code in this seems to work correctly unless the array size is zero (but GIGO -- garbage in, garbage out; or add if (length <= 0 return 0; at the top).
  • Peregring-lk
    Peregring-lk about 9 years
    It's O( (n/2)^2) and Ω(n); and no idea about the average case :P
  • Paul Hankin
    Paul Hankin about 9 years
    So many upvotes, but this isn't efficient: it's O(n^2) when there's few duplicates.
  • greybeard
    greybeard about 9 years
    For all the bold glyphs, what did you make of helper data structure (e.g. hashtable) should not be used?
  • Sharief Muzammil
    Sharief Muzammil about 9 years
    Not necessarily needed. I just highlighted those for the purpose of understanding.
  • Mozein
    Mozein about 9 years
    this is the brute force solution. Trust me Microsoft will not like this one
  • Sathesh
    Sathesh almost 9 years
    arr[i+1] should throw ArrayIndexOutOfBoundsException for the last element?
  • Eric Alford
    Eric Alford almost 7 years
    There are too many upvotes in this. As @Mozein said this is the brute force solution and is not the right answer. Microsoft will definitely turn someone away for presenting this as the solution.
  • GabrielBB
    GabrielBB almost 6 years
    @Sathesh No. Because of "< arr.length-1"