Length of a C++ std::string in bytes

30,785

Solution 1

When dealing with non-char instantiations of std::basic_string<>, sure, length may not equal number of bytes. This is particularly evident with std::wstring:

std::wstring ws = L"hi";
cout << ws.length();     // <-- 2, not 4

But std::string is about char characters; there is no such thing as a multi-byte character as far as std::string is concerned, whether you crammed one in at a high level or not. So, std::string.length() is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into an std::string, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.

Solution 2

If we are talking specifically about std::string, then length() does return the number of bytes.

This is because a std::string is a basic_string of chars, and the C++ Standard defines the size of one char to be exactly one byte.

Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.

EDIT: The Standard does say that an implementation shall provide a definition for CHAR_BIT which says how many bits are in a byte.

By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.

Solution 3

A std::string is std::basic_string<char>, so s.length() * sizeof(char) = byte length. Also, std::string knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.

If you have UTF-8 data in a std::string, you'll need to use something else such as ICU to get the "real" length.

Share:
30,785
ComicSansMS
Author by

ComicSansMS

Updated on July 09, 2022

Comments

  • ComicSansMS
    ComicSansMS almost 2 years

    I'm having some trouble figuring out the exact semantics of std::string.length(). The documentation explicitly points out that length() returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.

    In particular, is this only relevant to non-char instantiations of std::basic_string<> or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length() to be UTF8-aware?

  • ComicSansMS
    ComicSansMS over 12 years
    That makes perfect sense. I simply got confused by the wording in the documentation here. Thanks for clearing things up.
  • Lightness Races in Orbit
    Lightness Races in Orbit over 12 years
    @ComicSansMS: Not a problem :)
  • Lightness Races in Orbit
    Lightness Races in Orbit over 12 years
    Indeed, "byte" is not necessarily synonymous with "octet".
  • Mike Seymour
    Mike Seymour over 12 years
    The standard does define CHAR_BIT, the number of bits in a byte.
  • John Dibling
    John Dibling over 12 years
    @Mike: True, but the Standard doesn't say what that's defined to. When I said "doesn't say how many bits are in a byte" I meant in a precise, unambigious sense. But I'll clarify my post with an edit, thanks for pointing this out.
  • Adrian Ratnapala
    Adrian Ratnapala over 12 years
    But std::string is about char characters, so the definition of "character" in C++ is "element of some string type", rather than "what a human sees, encoded" or "a unicode codepoint, encoded somehow". This sounds believable, but can anyone quote chapter-and-verse on this?
  • Lightness Races in Orbit
    Lightness Races in Orbit over 12 years
    @AdrianRatnapala: It's less that the standard says it doesn't care about encodings, and more about it not saying that it does. Still, 2.3/1 might be of interest - it defines the "basic character set". And 2.3/3 says: The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
  • Adrian Ratnapala
    Adrian Ratnapala over 12 years
    Well, I guess that's what I get for asking for chapter-and-verse.
  • Lightness Races in Orbit
    Lightness Races in Orbit over 12 years
    @AdrianRatnapala: Yes, when asking for chapter-and-verse, you get chapter-and-verse. Anything else I can help you with? :)