In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

13,431

After a much troubled and confusing morning, we found the answer to this problem.

The key point we were missing, which was making this very confusing, was that string types are always encoded in 16-bit (2-byte) Unicode. This means that when we do a GetString() on the bytes, they are automatically being re-encoded into Unicode behind the scenes and we are no better off than we were in the first place.

When we started to get character errors, and double byte data at the other end, we knew something was wrong but at a glance of the code we had, we couldn't see anything wrong. After learning what we have explained above, we realised that we needed to send the byte array if we wanted to preserve the encoding. Luckily, MicrosoftFunc() had an overload which was able to take a byte array instead of a string. This meant that we could convert the unicode string to an encoding of our choice and then send it off exactly as we expect it. The code changed to:

// Convert from a Unicode string to an array of bytes (encoded as UTF8).
byte[] source = Encoding.UTF8.GetBytes(unicode); 

// Send the encoded byte array directly! Do not send as a Unicode string.
MicrosoftFunc(source);

Summary:

So in conclusion, from the above we can see that:

  • GetBytes() amongst other things, does an Encoding.Convert() from Unicode (because strings are always Unicode) and the specified encoding the function was called from and returns an array of encoded bytes.
  • GetString() amongst other things, does an Encoding.Convert() from the specified encoding the function was called from to Unicode (because strings are always Unicode) and returns it as a string object.
  • Convert() actually converts a byte array of one encoding to another byte array of another encoding. Obviously strings cannot be used (because strings are always Unicode).
Share:
13,431

Related videos on Youtube

Ryall
Author by

Ryall

Does web-stuff on a web-thing in the web-verse.

Updated on April 17, 2022

Comments

  • Ryall
    Ryall about 2 years

    We are having trouble getting a Unicode string to convert to a UTF-8 string to send over the wire:

    // Start with our unicode string.
    string unicode = "Convert: \u10A0";
    
    // Get an array of bytes representing the unicode string, two for each character.
    byte[] source = Encoding.Unicode.GetBytes(unicode);
    
    // Convert the Unicode bytes to UTF-8 representation.
    byte[] converted = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, source);
    
    // Now that we have converted the bytes, save them to a new string.
    string utf8 = Encoding.UTF8.GetString(converted);
    
    // Send the converted string using a Microsoft function.
    MicrosoftFunc(utf8);
    

    Although we have converted the string to UTF-8, it's not arriving as UTF-8.

  • Christoffer Hammarström
    Christoffer Hammarström almost 13 years
    There is some confusion here. There is no encoding called Unicode. Unicode is the name of a character set, which can be encoded in bytes using an encoding, for example UTF-8 or UTF-16. Thus Encoding.Unicode is severely misnamed, since it implements little-endian UTF-16 encoding. It should really have been called Encoding.UTF16LE. Strings are sequences of characters, and what encoding they're stored as in the underlying platform is irrelevant. It's an implementation detail that they happen to be stored as UTF-16.
  • thnee
    thnee over 7 years
    There is nothing wrong with calling it Encoding.Unicode, at some level Unicode is an encoding. The fact that a platform chooses to use UTF-16 or UTF-8 is just an implementation detail. When you use the string, it doesn't really matter what encoding it has internally. As long as the platform provides method to encode in an out, you don't necessarily even have to know what the internal encoding is at all. Some languages, python for example, don't say any encoding at all in the API, they just call it "a string" and you encode to and decode from that, that's an even cleaner approach.