How to get byte size of multibyte string

c string character-encoding size multibyte

10,694

Solution 1

According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.

Also, multibyte strings do not (AFAIK) contain embedded nulls, no.

I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.

Solution 2

Let's see if I can clear this up:

"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.

Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:

text: t  h  é     \0
mem:  74 68 c3 a9 00

This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:

struct my_string
{
    size_t length;
    char *data;
};

... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)

For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.

Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:

text:   t      h      é      \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem:    74 00  68 00  e9 00  00 00

That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.

Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.

10,694

Author by

flacs

Updated on July 24, 2022

Comments

flacs almost 2 years
How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?

Or, more general, how do I get the right byte size of a TCHAR string?

Solution:
```
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
```
EDIT:
I was talking about null-terminated strings only.
Steve Jessop almost 14 years

UTF-8 strings don't contain embedded nulls (specifically: the only place a 0 byte ever occurs is representing the 0 code point, so if that's your terminator then you can search for it byte-wise). I'm not sure whether UTF-16 is considered a "multibyte encoding" in this context, but it can certainly contain 0 bytes, just not 0 double-bytes. I think SHIFT-JIS doesn't use 0 bytes except when encoding 0. Lots of encodings in the world, but I'm not sure what's possible within Windows locales...
Thanatos almost 14 years

That's a bit muddled: UTF-8 strings can contain nulls, if you're storing the size in something other than a null terminator. Null terminated strings cannot contain nulls, because they're null terminated. A null terminated UTF-8 string cannot contain nulls for the same reason. That said, I cannot think of any useful purpose to putting a null in a UTF-8 string other than to terminate it.
flacs almost 14 years

"(Also: the number of bytes / 2 != the number of characters)" How so?
Thanatos almost 14 years

@Tilka: That's the way UTF-16 encodes characters. UTF-16 can encode more than 65,536 different characters, so it should be clear that 2 bytes are not enough. UTF-16 encodes many characters as just 2 bytes, but must use 4 for some, in a form known as "Surrogate pairs" (See Wikipedia's article on UTF-16.)
flacs almost 14 years

Ah yes, I confused it with UCS-2. Nice explanation btw, but the other answer was straight to the point.