Does a strings length equal the byte size?


Solution 1

Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.

By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.

Solution 2

It entirely depends on the platform and representation.

For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.

Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.

As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.

Solution 3

It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.

Solution 4

Not always, it depends on the encoding.

Solution 5

There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)

Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)

Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.

Unicode strings (in various languages) use two bytes per char.

Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.

Author by


I am a website and web application developer in Calgary, Alberta. I have been doing backend web development in PHP and frontend in HTML/CSS/JavaScript for over 20 years. My specialties are Symfony, Vue, Event Sourcing & CQRS, Craft CMS, WordPress. I've built everything from basic basic brochure style websites to heavily trafficked eCommerce site and social platforms to internal applications.

Updated on January 04, 2020


  • penetra
    penetra over 4 years

    Exactly that: Does a strings length equal the byte size? Does it matter on the language?

    I think it is, but I just want to make sure.

    Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.

    As the answer is no, that's all I need know.