C: What is the best and fastest way to concatenate strings

26,744

Solution 1

Joel Spolsky, in his Back to Basics article, describes the problem of inefficient string concatenation with strcat as the Shlemiel the painter's algorithm (read the article, it's quite good). As an example of inefficient code, he gives this example, which runs in O(n2) time:

char bigString[1000];     /* I never know how much to allocate... */
bigString[0] = '\0';
strcat(bigString,"John, ");
strcat(bigString,"Paul, ");
strcat(bigString,"George, ");
strcat(bigString,"Joel ");

It's not really a problem to walk over the first string the first time; since we've already got to walk over the second string, the runtime of one strcat is linear in the length of the result. Multiple strcats is problematic though, because we walk over the previously concatenated results again and again. He provides this alternative:

How do we fix this? A few smart C programmers implemented their own mystrcat as follows:

char* mystrcat( char* dest, char* src )
{
     while (*dest) dest++;
     while (*dest++ = *src++);
     return --dest;
}

What have we done here? At very little extra cost we're returning a pointer to the end of the new, longer string. That way the code that calls this function can decide to append further without rescanning the string:

char bigString[1000];     /* I never know how much to allocate... */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

This is, of course, linear in performance, not n-squared, so it doesn't suffer from degradation when you have a lot of stuff to concatenate.

Of course, this is what you can do if you want to use standard C strings. The alternative that you're describing of caching the length of the string and using a special concatenation function (e.g., calling strcat with slightly different arguments) is sort of a variation on Pascal strings, which Joel also mentioned:

The designers of Pascal were aware of this problem and "fixed" it by storing a byte count in the first byte of the string. These are called Pascal Strings. They can contain zeros and are not null terminated. Because a byte can only store numbers between 0 and 255, Pascal strings are limited to 255 bytes in length, but because they are not null terminated they occupy the same amount of memory as ASCIZ strings. The great thing about Pascal strings is that you never have to have a loop just to figure out the length of your string. Finding the length of a string in Pascal is one assembly instruction instead of a whole loop. It is monumentally faster.

For a long time, if you wanted to put a Pascal string literal in your C code, you had to write:

char* str = "\006Hello!";

Yep, you had to count the bytes by hand, yourself, and hardcode it into the first byte of your string. Lazy programmers would do this, and have slow programs:

char* str = "*Hello!";
str[0] = strlen(str) - 1;

Solution 2

If you want it easy, fast, general, and safe, I suggest using the open_memstream() function (it's a part of the POSIX-2008 standard, unfortunately it did not make it into the C11 standard, thought). It works like this:

First, you hand it the address of a pointer and a size

char* result = NULL;
size_t resultSize = 0;
FILE* stream = open_memstream(&result, &resultSize);

the return value is a file stream just as if you had used fopen() to open a file. As such, you can use the entire arsenal of fprintf() & co. to stream anything you like to your memory buffer, which is automatically allocated and managed for you. Most importantly, it also keeps track of the size of the accumulated string, so it does not have to rescan it to compute its size.

for(int i = 0; i < 1000000; i++) {
    fprintf(stream, "current number is %d, or 0x%x\n", i, i);
}

Finally, you close the stream, which will update your result pointer and the size variable to reflect the actual amount of string data written.

fclose(stream);
//Now you have a zero terminated C-string in result, and also its size in resultSize.
//You can do with it whatever you like.
//Just remember to free it afterwards:
free(result);

Solution 3

To concatenate multiple strings, code may use strlen() and memcpy() which are both often well optimized functions.

With this approach, easy to add an inexpensive size limit.
Given the real chance the destination buffer may overflow otherwise, a size limit is essential.

Run time proportional to the sum of string lengths: O(len(S[0]) + len(S[1]) + len(S[2]) + ...)

char *strsncat(char *dest, size_t size, char * strs[], size_t n) {
  assert(size > 0);
  size--;
  char *p = dest;
  while (n-- > 0) {
    size_t len = strlen(*strs);
    if (len >= size) {
      len = size;
    }
    size -= len;
    memcpy(p, *strs, len);
    strs++;
    p += len;
  }
  *p = '\0';
  return dest;
}

void cat_test(void) {
  char dest[10];
  char *strs[]  = { "Red", "Green", "Blue" };
  printf("'%s'\n",strsncat(dest, sizeof dest, strs, sizeof strs/sizeof strs[0]));
  // 'RedGreenB'
}

Solution 4

This is a late answer, but I just have come across the same issue. To find a starting point, I decided to re-read the man pages for strcpy, strncpy, strlen, strnlen, strcat and strncat.

I have almost missed it, but fortunately ... there is an interesting passage in man strcpy on my development system (Debian stretch). Citing it (formatting mine):

strlcpy()

Some systems (the BSDs, Solaris, and others) provide the following function:

size_t strlcpy(char *dest, const char *src, size_t size);

This function is similar to strncpy(), but it copies at most size-1 bytes to dest, always adds a terminating null byte, and does not pad the target with (further) null bytes. This function fixes some of the problems of strcpy() and strncpy(), but the caller must still handle the possibility of data loss if size is too small. The return value of the function is the length of src, which allows truncation to be easily detected: if the return value is greater than or equal to size, truncation occurred. If loss of data matters, the caller must either check the arguments before the call, or test the function return value. strlcpy() is not present in glibc and is not standardized by POSIX, but is available on Linux via the libbsd library.

Yes, you are reading this right: The man page of a glibc function contains a hint to a non-standardized function in another library which does the job better. This might prove how important this issue is.

By the way, I'll never get why the designers of the str(n)cpy() functions didn't chose the number of bytes copied or a pointer to the new end of dest as return value. Returning just dest seems silly because those functions do not change that parameter, so in every case it is still known to the caller when the function has returned, and hence this choice does not make any sense. Did I miss something?

Until I knew about strlcpy(), I mostly have used my own string concatenation functions, something like @Joshua Taylor has shown in his answer. This idea has its own problems, though:

Scanning / copying strings byte by byte might be very inefficient. Depending on the target CPU, we should use the 32-bit or even 64-bit registers and copy multiple bytes at a time. Of course, this makes the function more complicated since we have to check if enough bytes remain to be copied, and if not, use the next smaller register size. To further improve performance, we should use assembly code to implement our function.

AFAIK, libraries like glibc and libbsd have it implemented that way. So it might be best to use the libbsd implementation. I have not done performance measurements, though.

Solution 5

Assume you have two string: s1 and s2 with length l1 and l2 Concatenation means that you should generate a new string s3 with length l1+l2. The time complexity of this operation is O(l1+l2). From this point of view, strcat() seems to be the best choice.

However, if you want to indicate the state that two strings are concatenated, then you only need to record their pointers, which is O(1). A simple example would be like this:

typedef struct ConcatStr {
    char* str1;
    char* str2;
} ConcatStr;
ConcatStr myStrcat( char* str1, char* str2 )
{
    ConcatStr cstr;
    cstr.str1 = str1;
    cstr.str2 = str2;
}
Share:
26,744
SomethingSomething
Author by

SomethingSomething

Updated on October 17, 2021

Comments

  • SomethingSomething
    SomethingSomething over 2 years

    I currently concatenate strings in c using the strcat() function from string.h library.

    I thought about it, and I got to a conclusion that it should be very expensive function, as before it starts to concatenate, it has to iterate over the char array until it finds the '\0' char.

    For example, if I concatenate the string "horses" 1000 times using strcat(), I'll have to pay (1 + 2 + 3 + ... + 1000) * strlen("horses") = (1000*1001)/2 * 6 = 3003000

    I thought about the non-standard way, of maintaining an integer with the string length, and then sending to strcat() the pointer to the end of the string:

    strcat(dest + dest_len, "string");
    

    In this case, I'll pay only 1000 * strlen("horses") = 1000 * 6 = 6000.

    6000 is 500x smaller than 3003000, so it can be very critical for performance if you make a lot of such concatenations.

    Is there some more standard way to do it, looking better than my solution?