wchar ends with single null byte or two of them?

10,259

Solution 1

Since a wide string is an array of wide characters, it couldn't even end in an one-byte NUL. It is a two-byte NUL. (Arrays in C/C++ can only hold members of the same type, so of the same size).

Also, for ASCII standard characters, there always is one or three one-byte 0, as only extended characters start by a non-zero first byte (depending on whether wchar_t is 16 or 32 bit wide - for simplicity, I assume 16-bit and little-endian):

HELLO is 72 00 69 00 76 00 76 00 79 00 00 00

Solution 2

Here you can read a bit more of Wide Characters: http://en.wikipedia.org/wiki/Wide_character#Size_of_a_wide_character

Terminations are L'\0', means a 16-bit null so it's like two 8-bit null chars.

Remember that "009A" is only 1 wchar so is not a null wchar.

Solution 3

In C (quoting the N1570 draft, section 7.1.1):

A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character.

where a "wide character" is a value of type wchar_t, which is defined in <stddef.h> as an integer type.

I can't find a definition of "wide string" in the N3337 draft of the C++ standard, but it should be similar. One minor difference is that wchar_t is a typedef in C, and a built-in type (whose name is a keyword) in C++. But since C++ shares most of the C library, including functions that act on wide strings, it's safe to assume that the C and C++ definitions are compatible. (If someone can find something more concrete in the C++ standard, please comment or edit this paragraph.)

In both C and C++, the size of a wchar_t is implementation-defined. It's typically either 2 or 4 bytes (16 or 32 bits, unless you're on a very exotic system with bytes bigger than 8 bits). A wide string is a sequence of wide characters (wchar_t values), terminated by a null wide character. The terminating wide character will have the same size as any other wide character, typically either 2 or 4 bytes.

In particular, given that wchar_t is bigger than char, a single null byte does not terminate a wide string.

It's also worth noting that byte order is implementation-defined. A wide character with the value 0x1234, when viewed as a sequence of 8-bit bytes, might appear as any of:

  • 0x12, 0x34
  • 0x34, 0x12
  • 0x00, 0x00, 0x12, 0x34
  • 0x34, 0x12, 0x00, 0x00

And those aren't the only possibilities.

Solution 4

if you declare

WCHAR tempWchar[BUFFER_SIZE];

you make it null

for (int i = 0; i < BUFFER_SIZE; i++)
            tempWchar[i] = NULL;
Share:
10,259
Kosmo零
Author by

Kosmo零

Updated on June 05, 2022

Comments

  • Kosmo零
    Kosmo零 almost 2 years

    I just don't understand and can't find much info about wchar end.

    If it ends with single null byte, how it know it not string end yet, if something like that "009A" represent one of unicode symbols?

    If it ends with two null bytes? Well, I am not sure about it, need confirmation.

    • Kosmo零
      Kosmo零 over 11 years
      in C++, i didn't knew wchar exist somewhere else
    • j.w.r
      j.w.r over 11 years
      Somewhat related: Making a WCHAR null terminated. Might be some hints in there as to how to approach this.
    • Keith Thompson
      Keith Thompson over 11 years
      In C++, wchar_t (not wchar) is a predefined type. In C, wchar_t is a typedef defined in <stddef.h>. In both cases, the size is implementation-defined; on my system its size is 4 bytes (32 bits).
  • Kosmo零
    Kosmo零 over 11 years
    err, so if i access array of wchar like that: arr[0] = 0; it will set to zero first and second byte automatically?
  • Admin
    Admin over 11 years
    @Kosmos (If this is not yet clear, I suggest you to read a good tutorial on C pointers and arrays!)
  • Kosmo零
    Kosmo零 over 11 years
    Is there anyway that wchar can be converted to char? I reversing chinese app, but as i see they are using char* for text manipulations. Could it be just wchar array converted to char* of double size?
  • Admin
    Admin over 11 years
    @Kosmos There are libraries with which you can convert UTF-16 (wide strings) to UTF-8.
  • Keith Thompson
    Keith Thompson over 11 years
    @H2CO3: On my system, sizeof (wchar_t) == 4. You also seem to be making assumptions about endianness.
  • Admin
    Admin over 11 years
    @KeithThompson yup, that sizeof is perfectly fine. And no, I am not making assumptions about endianness - whether it be little or big endian, it's easier to conceive the essentials if I write all this using big endian notation...
  • Kosmo零
    Kosmo零 over 11 years
    I am trying to solve task to scan Chinese exe for text strings, for that i need to know how many bytes in the end - two null bytes or 4
  • Keith Thompson
    Keith Thompson over 11 years
    @H2CO3: "only extended characters start by a non-zero first byte" -- that assumes big-endian (with your recent edit, you've made the assumption explicit).
  • Admin
    Admin over 11 years
    @KeithThompson yes, sorry, you're correct - modern processor architectures that count use the counterintuitive little-endian notation, so that's why I was confusing them...
  • Mooing Duck
    Mooing Duck over 11 years
    Since this question is about the double byte null at the end of hte string, it's very strange that your sample string doesn't demonstrate that.
  • jcsahnwaldt Reinstate Monica
    jcsahnwaldt Reinstate Monica over 6 years
    HELLO is 72 00 69 00 76 00 76 00 79 00 in little-endian byte order. The "end" in "endian" actually means the "front end" of the sequence: "In big-endian format, the most significant byte is stored first (has the lowest address) or sent first, then the following bytes are stored or sent in decreasing significance order, with the least significant byte stored last (having the highest address) or sent last." en.wikipedia.org/wiki/Endianness