Removing duplicates from an array using divide and conquer in O(nlogn) time

10,830

A simple modification to merge sort will do the trick. Merge sort is O(n log n). It's also stable, meaning that items with equal keys will remain in their same relative order. The only difference here is that you want to eliminate duplicates. The only change to the code is in the comparison and copying of items. For example, the inner loop of a standard merge sort looks like this:

while (leftIndex < leftMax && rightIndex < rightMax)
{
    if (a[leftIndex] <= a[rightIndex])
    {
        output[outputIndex] = a[leftIndex];
        ++leftIndex;
    }
    else
    {
        output[outputIndex] = a[rightIndex];
        ++rightIndex;
    }
}

You would modify the code so you don't copy equal items:

while (leftIndex < leftMax && rightIndex < rightMax)
{
    if (a[leftIndex] < a[rightIndex])
    {
        output[outputIndex] = a[leftIndex];
        ++leftIndex;
    }
    else if (a[leftIndex] > a[rightIndex])
    {
        output[outputIndex] = a[rightIndex];
        ++rightIndex;
    }
    else
    {
        // items are equal.
        // increment the right, but don't copy anything
        ++rightIndex;
    }
}

You could also do this with a standard merge sort and then do a final pass over the sorted array to remove duplicates.

You could use Quicksort with a custom comparison function that compares the telephone number and the date/time. Then do a final pass over the sorted array to remove duplicates.

Both Quicksort and Mergesort are considered divide and conquer algorithms.

10,830

Author by

user2931097

Updated on November 25, 2022

Comments

user2931097 over 1 year

You are given an Array A of n requests for 2010 olympic tickets. The array is ordered by the time of the request so that A(1) is the first to arrive and A(2) is the second to arrive etc. Each request contains a ten digit telephone number. In order to try to be fair the olympic organizers have made a rule that there can only be one request from each telephone number. It has been noticed that array A contains more than one request from some telephone numbers. Write an O(nlogn) time divide and conquer algorithm to remove from A all requests from the same telephone number except the first received. The final output should be array A containing m<=n requests from a unique telephone number. Also the requests in A should remain in the same order as they were before the duplicates were removed.

I get how this can be done if the array was sorted by telephone number however I do not see how it is possible when the array is sorted by request time.
- Rozuur over 10 years
  
  modify merge sorts, merge routine to not include duplicates
- user2931097 over 10 years
  
  But as I merge elements, will I not have to compare the element being merged to all the already merged elements (O(n2)) to see if it is a duplicate? I am unclear on how to increase the merge sort pointers.
- Bernhard Barker over 10 years
  
  "I get how this can be done if the array was sorted by telephone number" - so sort it first, it only takes O(n log n)
- Abhishek Bansal over 10 years
  
  @Dukeling The initial order needs to be preserved. So the array will again have to be sorted as per the arrival time. Ofcourse that seems to be a valid solution if array entries are pairs<telephoneNo,time>.
- user2931097 over 10 years
  
  Yes I ran into the same problem when trying to use merge sort to sort the pairs by arrival time. I don't know how to avoid merging a duplicate without comparing it to all the other already merged values.
- czar x over 10 years
  
  I am not sure about a divide and conquer technique. But looks like pretty straight forward hash function computation, for instance we can just take unique phone numbers. once a collision occurs we just discard. This will take space as (all unique numbers) and one iteration over the list.