Why null-terminated strings? Or: null-terminated vs. characters + length storage

11,349

Solution 1

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

Solution 2

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

Solution 3

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

Solution 4

Lengths have their problems too.

  • The length takes extra storage (not such an issue now, but a big factor 30 years ago).

  • Every time you alter a string you have to update the length, so you get reduced performance across the board.

  • With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.

  • NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.

Solution 5

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

Share:
11,349
Imagist
Author by

Imagist

I am a computer science student and professional programmer. My passion is programming languages; learning them, designing them, and implementing them. I am also an amateur cook, martial artist, mathematician, musician, scientist, and writer.

Updated on June 17, 2022

Comments

  • Imagist
    Imagist almost 2 years

    I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

    struct String
    {
        char* characters;
        size_t length;
    };
    

    Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

    Bounds-checking is built-in if you use "length" instead of looking for a null.

    You have to traverse the entire string to find its length.

    You have to do extra stuff to handle a null character in the middle of a null-terminated string.

    Null-terminated strings deal poorly with Unicode.

    Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

    String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

    struct String slice(struct String in, size_t begin, size_t end)
    {
        struct String out;
        out.characters = in.characters + begin;
        out.length = end - begin;
    
        return out;
    }
    
    char* slice(char* in, size_t begin, size_t end)
    {
        char* out = malloc(end - begin + 1);
    
        for(int i = 0; i < end - begin; i++)
            out[i] = in[i + begin];
    
        out[end - begin] = '\0';
    
        return out;
    }
    

    After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

    So my question is: are there any benefits to null-termination that I'm missing?

  • weiqure
    weiqure over 14 years
    Can you give an example of a string of that you might want to print the end of the string?
  • Dan McGrath
    Dan McGrath over 14 years
    Take a front trim of white space. You can determine the first non-white space char and instead of actually changing the string, you can just pass the starting offset in, saving you either allocating new memory, or copying data.
  • Daniel Earwicker
    Daniel Earwicker over 14 years
    It's enough for 2^CHAR_BIT characters per string.
  • Jason Williams
    Jason Williams over 14 years
    You can truncate a string-with-length by just reducing the length. Usually this means you have two lengths - the current length of the string plus the amount of memory you have currently allocated for the string. This allows it to be dynamically resized without having to realloc on every modification.
  • Kirill V. Lyadvinsky
    Kirill V. Lyadvinsky over 14 years
    It is only 255 characters. Too little.
  • caf
    caf over 14 years
    No, it's enough for 2^CHAR_BIT - 1 characters per string, and if you've never had to deal with strings longer than 255 characters then you've only had to program for a very limited problem domain. However C does say concrete things about other integral types - for example int has to have a range of at least -32767 to +32767. In particular C says that size_t must be able to hold the size of any object, so that would be fine as a standardised string length.
  • RBerteig
    RBerteig over 14 years
    This is what Lua does. It makes interfacing to C for normal use cases very clean, and still supports arbitrary length binary buffers.
  • Daniel Earwicker
    Daniel Earwicker over 14 years
    It's what most things do! You don't even have to maintain the null terminator all the time - just do str[len] = '\0' whenever you need to. This is what `std::string::c_str`` usually does in C++.
  • AProgrammer
    AProgrammer over 14 years
    Denis Ritchie opinion is somewhat different. BCPL had a length+content representation, with the length contained in one byte. B switched to terminated string "partially to avoid the limitation on the length of a string caused by holding the count in an 8 or 9 bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." (The Development of the C Language, cm.bell-labs.com/cm/cs/who/dmr/chist.pdf)
  • Daniel Earwicker
    Daniel Earwicker over 14 years
    By most things, I mean most string classes, and most interpreter string representations. One widely-used example on Windows is the BSTR type.
  • user2507101
    user2507101 over 14 years
    Extra storage can still be a big issue for embedded systems, which is one reason to stress keeping the language light-weight.
  • user2507101
    user2507101 over 14 years
    "When CPU registers were a scarce resource" - registers are still a scarce resource on x86 and x64.
  • Dan McGrath
    Dan McGrath over 14 years
    Indeed, that is the way to go; I had based my answer on the op's string struct, which would enable you to reduce the length, but never to utilise that space again. Ropes are another interesting way to handle strings. en.wikipedia.org/wiki/Rope_(computer_science)
  • Imagist
    Imagist over 14 years
    This is exactly why I asked this question; I thought I might be missing some solution. It seems obvious now but I didn't think of this!
  • Daniel Earwicker
    Daniel Earwicker over 14 years
    @caf, you aren't thinking creatively enough if you can't see how to represent that extra unit of character length. :)
  • Daniel Earwicker
    Daniel Earwicker over 14 years
    Cool! Well, you see that green check icon next to my answer...?
  • Imagist
    Imagist over 14 years
    @Jimmy My issue there is: on such embedded systems, why are you using strings? I don't think I even ever used a char back when I was doing robotics programming. The only example I can think of is if you're programming for an LED display (like those scrolling text things or the ones on soft drink machines) but the functionality there is so simple that I still have a difficult time imagining the extra 3 bytes being a problem (4 byte int - 1 byte since you don't have to store the null character).
  • Imagist
    Imagist over 14 years
    @Jla3ep Did you look at my code sample? I used size_t to store the length for a reason.
  • Imagist
    Imagist over 14 years
    @caf I think Earwicker means that you can have the characters pointer point to NULL.
  • Imagist
    Imagist over 14 years
    ...for a zero-length string (I hate that you can't edit comments!).
  • Imagist
    Imagist over 14 years
    A few questions: assuming a byte is 8 bits, shouldn't a 32-bit system have sizeof(size_t) == 4 and sizeof(char*) == 4? And with my method you don't have to use 8 characters for the first method. So I'm getting: 4 + 4 + 7 = 15 for my method, and 4 + 8 = 12 for the null-terminating method. I'm not contesting your point, just the math that leads to your point.
  • Imagist
    Imagist over 14 years
    @Dan McG (comment) My strings will be stored in a garbage-collected system, which will allow me to reclaim the space. And my interpreter uses ropes internally; the garbage collector will use idle cycles to flatten ropes into strings.
  • Imagist
    Imagist over 14 years
    That's not all that different for what I do with my struct: struct String new; new.characters = old.characters + offset; new.length = old.length - offset; It's a bit of bookkeeping but comes out to what, 5 instructions? This seems trivial compared to the difference if you needed to do something to the beginning of the string instead of the end.
  • Kirill V. Lyadvinsky
    Kirill V. Lyadvinsky over 14 years
    @Imagist, you asked why C uses null-terminate strings. I believe that size of size_t was equal to 1 in the beginning.
  • sharptooth
    sharptooth over 14 years
    What you suggest is not that trivial btw. Who will take ownership of the buffer? Will the newly created temporary string do that? I doubt you want this and then you need a way to have such non-owning strings to avoid copying.
  • caf
    caf over 14 years
    The point is that during a processing loop like the above, length-based strings like yours end up using two registers for string book-keeping whereas sentinel-based strings like idiomatic C strings only use one (the other one is obtained "for free" because the character values are being loaded in order to process them anyway).
  • Imagist
    Imagist over 14 years
    @Earwicker I was planning on checking that, I just like to wait a few days. There are few things more annoying to me than people who accept an answer within minutes of asking the question.
  • Mike Dunlavey
    Mike Dunlavey over 14 years
    It makes it really easy to do things like recursive string matching, spelling correction, etc., if you can treat the string like you would a list in Lisp.
  • Nick Johnson
    Nick Johnson over 14 years
    Most high level languages have immutable strings.
  • Samie Bencherif
    Samie Bencherif about 5 years
    @NickJohnson most low level languages have immutable strings. and most high level languages have mutable strings.
  • Carl Smith
    Carl Smith over 3 years
    Which high level languages have mutable strings?