How do I remove duplicate strings from an array in C?

12,633

Solution 1

The typical efficient unique function is to:

  1. Sort the given array.
  2. Verify that consecutive runs of the same item are setup so that only one remains.

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

Because you need to save the original order, you can have something like:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

EDIT2: Simple C implementation example of std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}

Solution 2

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

Solution 3

Can you control the input as it is going into the array? If so, just do something like this:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

Share:
12,633
Jerry Smith
Author by

Jerry Smith

Updated on June 04, 2022

Comments

  • Jerry Smith
    Jerry Smith almost 2 years

    I have an array of strings in C and an integer indicating how many strings are in the array.

    char *strarray[MAX];  
    int strcount;
    

    In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

    I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

    For example:

    strarray[0] = "Line 1"; 
    strarray[1] = "Line 2"; 
    strarray[2] = "Line 3"; 
    strarray[3] = "Line 2"; 
    strarray[4] = "Line 4";
    

    would become:

    strarray[0] = "Line 1"; 
    strarray[1] = "Line 3"; 
    strarray[2] = "Line 2"; 
    strarray[3] = "Line 4";
    

    Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

    I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

    The algorithm presented below would be ran every time a new string is added to the strarray.

    For the interest of showing that I am trying, I will include my proposed algorithm below:

    1. Search entire strarray for match to str
    2. If no match, do nothing
    3. If match found, put str in strarray
    4. Now we have a strarray with a max of 1 duplicate entry
    5. Add highest index strarray string to lowest index of temporary string array
    6. Continue downwards into strarray and check each element
    7. If duplicate found, skip it
    8. If not, add it to the next highest index of the temporary string array
    9. Reverse temporary string array and copy to strarray

    Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

    The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

    Thank you!

  • Jerry Smith
    Jerry Smith almost 14 years
    wouldn't sorting lose the order of the elements?
  • Billy ONeal
    Billy ONeal almost 14 years
    This works well; it's better than the OP's original solution. +1. But unfortunately the performance is still order n-squared :(
  • Jerry Smith
    Jerry Smith almost 14 years
    As I understand your solution, if it is in strarray already it does nothing. If it is not, it adds it. If I am correct in my understanding, this will not work. I can control the input as it is entering the array but this method would not produce the result I gave in my post. I need the surviving duplicate to be in the highest, not the lowest, index. If toadd already exists in strarray[1] it would not be added to strarray[N] where N > 1
  • Jerry Smith
    Jerry Smith almost 14 years
    Thank you for your edit! I am a little rusty on sorting but I can take it from here. I am going to try your idea and see how well it works. From what I understand, I need to iterate the strarray making a temp array of tempUniqueEntry in the process. Sort tempArray by string, remove duplicates, sort tempArray again by position, then reconstruct the strarray. Correct?
  • jdmichal
    jdmichal almost 14 years
    @Jerry Smith Your example is wrong then. It should read 1, 3, 2, 4. I'll correct my solution shortly... But that is a much more expensive operation, because it will require compacting the array each time.
  • jdmichal
    jdmichal almost 14 years
    @Jerry Smith Added removal and compacting. Please note my PS at the end.
  • Jerry Smith
    Jerry Smith almost 14 years
    @jdmichal you are right. My example was wrong. i have edited my post now to be correct. Sorry about that. I absolutely must have the array strings in their original order with the highest index kept and all lower index duplicates removed. However, I have the luxury of knowing my array will never have more than 10 elements in it so, in that regard, I am willing to go with a lower performing algorithm. I was just hoping something out there was better than my algorithm.
  • Jerry Smith
    Jerry Smith almost 14 years
    What I meant in step 5 is simply: // where 0 is the lowest index and 9 is the largest index available temparray[0] = strarray[9];
  • Yavor Angelov
    Yavor Angelov almost 14 years
    This is not C (see: new/delete)
  • Jerry Smith
    Jerry Smith almost 14 years
    I am doing this operation somewhat frequently but this is a temporary project so I am not too concerned with finding efficient methods. Just wanted something a little easier to understand and implement that might be a little faster.
  • Billy ONeal
    Billy ONeal almost 14 years
    @Jerry: Yes, that's correct. You don't need to implement your own sorting algorithm; the standard library's qsort can do that. You'll just have to define comparison functions. As for remove, you may wish to follow std::unique, (like this one: cplusplus.com/reference/algorithm/unique ) because it can make the entire array unique in linear time -- no constant resizing of the array will be required as things are removed. (Just re implement std::unique yourself, replacing the predicate function with a function pointer returning equals, and the iterators with pointers)
  • jdmichal
    jdmichal almost 14 years
    Cool. More down-votes for piddly crap. Corrected solution to use malloc and free instead of new and delete.
  • Jerry Smith
    Jerry Smith almost 14 years
    @jdmichal for what its worth, I wasn't looking for someone to give me full and complete tested code in C. I am smart enough to transform new/delete to malloc/free :) But apparently the stackoverflow community prefers answers to be complete when possible
  • Yavor Angelov
    Yavor Angelov almost 14 years
    @jdmichal: It's not 'piddly crap' - the code didn't even compile on a C compiler...
  • jdmichal
    jdmichal almost 14 years
    @Jerry Smith Thank you for understanding that some code typed into a webform in a time-pressured scenario may not compile without minor modifications, and instead understanding that the idea presented is the important part.