Algorithm: efficient way to remove duplicate integers from an array

c algorithm arrays duplicates

204,958

Solution 1

How about:

void rmdup(int *array, int length)
{
    int *current , *end = array + length - 1;

    for ( current = array + 1; array < end; array++, current = array + 1 )
    {
        while ( current <= end )
        {
            if ( *current == *array )
            {
                *current = *end--;
            }
            else
            {
                current++;
            }
        }
    }
}

Should be O(n^2) or less.

Solution 2

A solution suggested by my girlfriend is a variation of merge sort. The only modification is that during the merge step, just disregard duplicated values. This solution would be as well O(n log n). In this approach, the sorting/duplication removal are combined together. However, I'm not sure if that makes any difference, though.

Solution 3

I've posted this once before on SO, but I'll reproduce it here because it's pretty cool. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

Take the first element of the array, this will be the sentinel.
Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
Move all elements for which the index is equal to the hash to the beginning of the array.
Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

Here's an implementation in the D programming language:

void uniqueInPlace(T)(ref T[] dataIn) {
    uniqueInPlaceImpl(dataIn, 0);
}

void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
    if(dataIn.length - start < 2)
        return;

    invariant T sentinel = dataIn[start];
    T[] data = dataIn[start + 1..$];

    static hash_t getHash(T elem) {
        static if(is(T == uint) || is(T == int)) {
            return cast(hash_t) elem;
        } else static if(__traits(compiles, elem.toHash)) {
            return elem.toHash;
        } else {
            static auto ti = typeid(typeof(elem));
            return ti.getHash(&elem);
        }
    }

    for(size_t index = 0; index < data.length;) {
        if(data[index] == sentinel) {
            index++;
            continue;
        }

        auto hash = getHash(data[index]) % data.length;
        if(index == hash) {
            index++;
            continue;
        }

        if(data[index] == data[hash]) {
            data[index] = sentinel;
            index++;
            continue;
        }

        if(data[hash] == sentinel) {
            swap(data[hash], data[index]);
            index++;
            continue;
        }

        auto hashHash = getHash(data[hash]) % data.length;
        if(hashHash != hash) {
            swap(data[index], data[hash]);
            if(hash < index)
                index++;
        } else {
            index++;
        }
    }


    size_t swapPos = 0;
    foreach(i; 0..data.length) {
        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
            swap(data[i], data[swapPos++]);
        }
    }

    size_t sentinelPos = data.length;
    for(size_t i = swapPos; i < sentinelPos;) {
        if(data[i] == sentinel) {
            swap(data[i], data[--sentinelPos]);
        } else {
            i++;
        }
    }

    dataIn = dataIn[0..sentinelPos + start + 1];
    uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}

Solution 4

One more efficient implementation

int i, j;

/* new length of modified array */
int NewLength = 1;

for(i=1; i< Length; i++){

   for(j=0; j< NewLength ; j++)
   {

      if(array[i] == array[j])
      break;
   }

   /* if none of the values in index[0..j] of array is not same as array[i],
      then copy the current value to corresponding new position in array */

  if (j==NewLength )
      array[NewLength++] = array[i];
}

In this implementation there is no need for sorting the array. Also if a duplicate element is found, there is no need for shifting all elements after this by one position.

The output of this code is array[] with size NewLength

Here we are starting from the 2nd elemt in array and comparing it with all the elements in array up to this array. We are holding an extra index variable 'NewLength' for modifying the input array. NewLength variabel is initialized to 0.

Element in array[1] will be compared with array[0]. If they are different, then value in array[NewLength] will be modified with array[1] and increment NewLength. If they are same, NewLength will not be modified.

So if we have an array [1 2 1 3 1], then

In First pass of 'j' loop, array[1] (2) will be compared with array0, then 2 will be written to array[NewLength] = array[1] so array will be [1 2] since NewLength = 2

In second pass of 'j' loop, array[2] (1) will be compared with array0 and array1. Here since array[2] (1) and array0 are same loop will break here. so array will be [1 2] since NewLength = 2

and so on

Solution 5

If you are looking for the superior O-notation, then sorting the array with an O(n log n) sort then doing a O(n) traversal may be the best route. Without sorting, you are looking at O(n^2).

Edit: if you are just doing integers, then you can also do radix sort to get O(n).

View more solutions

204,958

Author by

fujy

Updated on February 05, 2020

Comments

fujy over 4 years

I got this problem from an interview with Microsoft.

Given an array of random integers, write an algorithm in C that removes duplicated numbers and return the unique numbers in the original array.

E.g Input: {4, 8, 4, 1, 1, 2, 9} Output: {4, 8, 1, 2, 9, ?, ?}

One caveat is that the expected algorithm should not required the array to be sorted first. And when an element has been removed, the following elements must be shifted forward as well. Anyway, value of elements at the tail of the array where elements were shifted forward are negligible.

Update: The result must be returned in the original array and helper data structure (e.g. hashtable) should not be used. However, I guess order preservation is not necessary.

Update2: For those who wonder why these impractical constraints, this was an interview question and all these constraints are discussed during the thinking process to see how I can come up with different ideas.
Douglas Leeder over 14 years

Indeed - an O(n) pass to find the largest element wouldn't increase the overall O() cost.
Douglas Leeder over 14 years

It's not clear if the answer has to be in the original array.
ChrisW over 14 years

Jeff B's answer is merely O(n). Hash-sets and hash-dictionaries are the bees knees.
Jeff B over 14 years

To do this without requiring a new array, you could simply replace the duplicate with an element popped off the end of the array, and redo the current loop, as the problem does not specify that order matters. This requires some extra bounds checking, but is very do-able.
sbi over 14 years

"One caveat is that the expected algorithm should not required the array to be sorted first."
Laurence Gonsalves over 14 years

ChrisW: hash sets/dictionaries are only O(1) if you assume no collisions. (I'm not saying I wouldn't use them for this problem -- I probably would -- it's just a fallacy to claim that they're truly O(1).)
Greg Rogers over 14 years

It doesn't say you can't sort the array once you get it... Without using O(N) external memory sorting is the only way to do it in O(N log N) or better.
Vitali over 14 years

Actually, since you know the size of the array before-hand, you can guarantee O(1). Then you can trade off collisions vs how much additional memory you use.
fujy over 14 years

For the purpose of the problem, standard library utils should not be used. Regarding sorting, however, the more I think of it, the more unsure I am whether it is ok or not.
Mark Ransom over 14 years

You might want to rethink that downvote - newly posted conditions to the problem make Jeff B's solution invalid.
Steve Jessop over 14 years

Given that they're only integers, for simplicity you could assume 32bit and not bother looking for min/max: 2^32 bits is "only" 512MB, so finding the bounds is just a memory-use and O(1) time optimisation (granted, a hefty optimisation in the case of the example given). And if they're 64bit, it's irrelevant since you don't know that the min and max won't be further apart than the number of bits of memory you have.
Mark Ransom over 14 years

You might want to elaborate on "traversal", since a naive erasure method might result in O(n^2) for large numbers of duplicates.
newpxsn over 14 years

Jeff's is still fine, you can do the comparison and motion in place. Take a look at my (sadly ignored) fully correct solution below.
Mark Ransom over 14 years

Great suggestion, but you'll need some bookkeeping to keep track of the end of each merge output. I actually did this once, and yes eliminating the duplicates as you merge makes it much faster.
Isabelle Wedin over 14 years

This was a good idea, until the question was edited. Your hashtable idea is apparently against the rules.
Mark Ransom over 14 years

In the modification to the original question, hash tables aren't allowed. Your two pointer approach is a nice way to compact the output once you've identified the duplicates, though.
LiraNuna over 14 years

I don't get why this answer gets voted the most. It's written in perl and uses vital features not available in C, as the question asks.
mocj over 14 years

If you overwrite the duplicates you find with the value at the end of the array you can avoid the shifting of the whole array in your inner for() loop. That will bring you to O(n^2) from O(n^3). My C implementation is floating around here somewhere...
Dominik over 14 years

I thought, shifting was part of the requirement, but you are right of course.
Dominik over 14 years

@mocj: I like your solution, looks very elegant. But I think it doesn't work if the last two elements are equal, because you stop checking for equality one before the last. (comenting here because have too view reputation to comment anywhere else :( )
mocj over 14 years

You're right except that the original problem states that the values at the end of the array are negligible. Since you aren't returning the length of the modified array the distinction between the last value and the second to last is unimportant when the two values are equal. Where does the caller interpret the end of the returned array to be
LiraNuna over 14 years

Theory aside, wouldn't allocating 512MB take more time than finding the min/max?
Steve Jessop over 14 years

Depends how much data there is, and what the min/max are. If you're looking at more than 512MB of input, then quite possibly it's faster to avoid that extra O(N) pass. Of course if you're looking at that much input, then it's less likely you have 512MB to spare. In cases where the min/max are close to 0/INT_MAX, then the optimisation doesn't help either. I'm just saying that although the first step obviously helps for small numbers, it can't avoid the fact that this algorithm uses UINT_MAX bits in the worst case, so you need to plan for that limitation.
Douglas Leeder over 14 years

You may well be right - in any case clarification of the question means that using a bit-array is out. I'll leave this answer in case someone comes along later without the constraints and wants to view all possible answers.
Douglas Leeder over 14 years

I think answers refering to C++ and C++ standard functions are useful, even if they don't answer the original question, as they provide a more rounded answer to people who find this question later.
Steve Jessop over 14 years

It's not clear whether O(N/2) extra space counts as the "helper data structure" banned in the question - I don't know whether the restriction is intended to stipulate O(1) extra space, or just to stipulate that the answer should not depend on a big ol' data structure implementation. Maybe a standard merge is fine. But if not, top tip: do not attempt to write an in-place merge sort in an interview, unless you really know what you're doing.
Kirk Broadhurst over 14 years

This is the simple solution and is more than likely what the interview question is looking for.
Trevor Tippins over 14 years

They might even be checking to see that you don't suffer from indulging in premature optimization unless they've given you runtime constraints too! :-)
IssEaPharma over 14 years

Nice one. I have a suggestion to improve. Second nested loop can be changed to for(j=0; j < NewLength; j++) and last if checking can be changed to if (j == NewLength )
Byju over 14 years

That was a great suggession. I have updated the code based on ur comment
Peter Recore over 14 years

the question asked for c code, not perl. using perl gets you hashtables and "push" for free. If i could do it in scala you would just call input.removeDuplicates, but i doubt that would have been acceptable to the interviewers :)
fujy over 14 years

This solution does not need pre-sorting and is what I was looking for.
ziggystar over 14 years

Lol, though it's definately faster to sort the array and work on the sorted one. Sorting should be provided by an API and is imho no premature optimization.
TMN almost 14 years

OTOH, the additional code required to implement a sorting tree might be less (memory) efficient than the simple solution, and is probably less efficient at run-time for small (say fewer than 100 elements) arrays.
Eastern Monk over 12 years

The 'seen' feature essentially a function of 'n' hence this algorithm is quadratic thought might appear to be linear.
dldnh about 12 years

care to elaborate or provide a reference?
Blastfurnace almost 12 years

Test this with the input arrayInteger = {100,10,1};
David Gorsline over 11 years

The OP specified that helper data structures be avoided, and that the results be returned in the original array. Your answer, unfortunately, does not meet these requirements.
Ankit Roy over 11 years

Extremely cool, underrated answer! I like the idea of using the element in position 1 as a sentinel value. If I could make a couple of small suggestions, it would be to change step 2 to include "each element is in the position corresponding to its hash modulo the array size", and perhaps clarify that the duplicates to be set to the sentinel are the elements that have the same value (as opposed to the same hash, or the same hash modulo array size).
Shail about 11 years

Shouldn't it be while ( current <= end ) instead of while ( current < end )?
Hardy Feng almost 11 years

Great idea. But it requires that remaining data keeps original order.
Yuriy Chernyshov almost 11 years

Fail at least if we have same values in array {1,1,1,1,1,1}. Useless code.
Yuriy Chernyshov almost 11 years

Fails at least with next inputs: {1,1,1,1,1,1,1} {0,0,0,0,0,1,1,1,1,1,1}
kumar over 10 years

This is not taking care about all inputs.for eg, 12223 will not be solved by this solution.
Admin over 10 years

@unkulunkulu o.O WAT??
SheetJS over 10 years

question is about C; this answer is C++
Pawan about 10 years

Why was this accepted as the right answer? If order preservation is not necessary then is it not better to just use merge sort O(nlogn) and then remove the repeated elements in O(n)... total complexity - O(nlogn) which is much better than this solution.
laksbv about 10 years

This solution is not working.. and no way we can find the length of new array
JavaSa about 10 years

Well what is the complexity of this, isn't it also O(n^2)?
fujy almost 10 years

The question does not allow extra data structure to be used.
Mike B over 9 years

A paper that describes what your girlfriend suggested follows: dc-pubs.dbs.uni-leipzig.de/files/…
greybeard over 9 years

Which question does this answer? How does it remove (possibly many) duplicates, in sublinear space, without hashing? Please elaborate do xor operation by traversing [...] elements.
Jonathan Leffler over 9 years

If you change the return type to int, and add int *base = array; at the top of the function and return current - base + 1; at the end, then you know how many entries exist in the compressed table. With that modification, the code in this seems to work correctly unless the array size is zero (but GIGO -- garbage in, garbage out; or add if (length <= 0 return 0; at the top).
Peregring-lk about 9 years

It's O( (n/2)^2) and Ω(n); and no idea about the average case :P
Paul Hankin about 9 years

So many upvotes, but this isn't efficient: it's O(n^2) when there's few duplicates.
greybeard about 9 years

For all the bold glyphs, what did you make of helper data structure (e.g. hashtable) should not be used?
Sharief Muzammil about 9 years

Not necessarily needed. I just highlighted those for the purpose of understanding.
Mozein about 9 years

this is the brute force solution. Trust me Microsoft will not like this one
Sathesh almost 9 years

arr[i+1] should throw ArrayIndexOutOfBoundsException for the last element?
Eric Alford almost 7 years

There are too many upvotes in this. As @Mozein said this is the brute force solution and is not the right answer. Microsoft will definitely turn someone away for presenting this as the solution.
GabrielBB almost 6 years

@Sathesh No. Because of "< arr.length-1"