Converting string to byte array in C#

1,569,675

Solution 1

If you already have a byte array then you will need to know what type of encoding was used to make it into that byte array.

For example, if the byte array was created like this:

byte[] bytes = Encoding.ASCII.GetBytes(someString);

You will need to turn it back into a string like this:

string someString = Encoding.ASCII.GetString(bytes);

If you can find in the code you inherited, the encoding used to create the byte array then you should be set.

Solution 2

First of all, add the System.Text namespace

using System.Text;

Then use this code

string input = "some text"; 
byte[] array = Encoding.ASCII.GetBytes(input);

Hope to fix it!

Solution 3

Encoding.Default should not be used...

Some answers use Encoding.Default, however Microsoft raises a warning against it:

Different computers can use different encodings as the default, and the default encoding can change on a single computer. If you use the Default encoding to encode and decode data streamed between computers or retrieved at different times on the same computer, it may translate that data incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback [i.e. the encoding is totally screwed up, so you can't reencode it back] to map unsupported characters to characters supported by the code page. For these reasons, using the default encoding is not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding. You could also use a higher-level protocol to ensure that the same format is used for encoding and decoding.

To check what the default encoding is, use Encoding.Default.WindowsCodePage (1250 in my case - and sadly, there is no predefined class of CP1250 encoding, but the object could be retrieved as Encoding.GetEncoding(1250)).

...UTF-8/UTF-16LE encoding should be used instead...

Encoding.ASCII in the most scoring answer is 7bit, so it doesn't work either, in my case:

byte[] pass = Encoding.ASCII.GetBytes("šarže");
Console.WriteLine(Encoding.ASCII.GetString(pass)); // ?ar?e

Following Microsoft's recommendation:

var utf8 = new UTF8Encoding();
byte[] pass = utf8.GetBytes("šarže");
Console.WriteLine(utf8.GetString(pass)); // šarže

Encoding.UTF8 recommended by others is an instance of UTF-8 encoding and can be also used directly or as

var utf8 = Encoding.UTF8 as UTF8Encoding;

Encoding.Unicode is popular for string representation in memory, because it uses fixed 2 bytes per char, so one can jump to n-th character in constant time at cost of more memory usage: it is UTF-16LE. In MSVC# The *.cs files are in UTF-8 BOM by default and string constants in them converted to UTF-16LE at compile time (see @OwnagelsMagic comment), but it is NOT defined as default: many classes like StreamWriter uses UTF-8 as default.

...but it is not used always

Default encoding is misleading: .NET uses UTF-8 everywhere (including strings hardcoded in the source code) and UTF-16LE (Encoding.Unicode) to store strings in memory, but Windows actually uses 2 other non-UTF8 defaults: ANSI codepage (for GUI apps before .NET) and OEM codepage (aka DOS standard). These differs from country to country (for instance, Windows Czech edition uses CP1250 and CP852) and are oftentimes hardcoded in windows API libraries. So if you just set UTF-8 to console by chcp 65001 (as .NET implicitly does and pretends it is the default) and run some localized command (like ping), it works in English version, but you get tofu text in Czech Republic.

Let me share my real world experience: I created WinForms application customizing git scripts for teachers. The output is obtained on the background anynchronously by a process described by Microsoft as (bold text added by me):

The word "shell" in this context (UseShellExecute) refers to a graphical shell (ANSI CP) (similar to the Windows shell) rather than command shells (for example, bash or sh) (OEM CP) and lets users launch graphical applications or open documents (with messed output in non-US environment).

So effectively GUI defaults to UTF-8, process defaults to CP1250 and console defaults to 852. So the output is in 852 interpreted as UTF-8 interpreted as CP1250. I got tofu text from which I could not deduce the original codepage due to the double conversion. I was pulling my hair for a week to figure out to explicitly set UTF-8 for process script and convert the output from CP1250 to UTF-8 in the main thread. Now it works here in the Eastern Europe, but Western Europe Windows uses 1252. ANSI CP is not determined easily as many commands like systeminfo are also localized and other methods differs from version to version: in such environment displaying national characters reliably is almost unfeasible.

So until the half of 21st century, please DO NOT use any "Default Codepage" and set it explicitly (to UTF-8 or UTF-16LE if possible).

Solution 4

var result = System.Text.Encoding.Unicode.GetBytes(text);

Solution 5

Also you can use an Extension Method to add a method to the string type as below:

static class Helper
{
   public static byte[] ToByteArray(this string str)
   {
      return System.Text.Encoding.ASCII.GetBytes(str);
   }
}

And use it like below:

string foo = "bla bla";
byte[] result = foo.ToByteArray();
Share:
1,569,675
nouptime
Author by

nouptime

Updated on January 15, 2022

Comments

  • nouptime
    nouptime over 2 years

    I'm converting something from VB into C#. Having a problem with the syntax of this statement:

    if ((searchResult.Properties["user"].Count > 0))
    {
        profile.User = System.Text.Encoding.UTF8.GetString(searchResult.Properties["user"][0]);
    }
    

    I then see the following errors:

    Argument 1: cannot convert from 'object' to 'byte[]'

    The best overloaded method match for 'System.Text.Encoding.GetString(byte[])' has some invalid arguments

    I tried to fix the code based on this post, but still no success

    string User = Encoding.UTF8.GetString("user", 0);
    

    Any suggestions?

  • nouptime
    nouptime about 11 years
    Timothy, I've looked through the VB code and I can't seem to find a byte array as you have mentioned.
  • Timothy Randall
    Timothy Randall about 11 years
    On your search result, what is the type of the Properties property?
  • nouptime
    nouptime about 11 years
    All I can see is that there are a number items attached to Properties as a string. I'm not sure if that's what you were asking me though.
  • Tom Blodget
    Tom Blodget over 8 years
    char and string are UTF-16 by definition.
  • Mandar Sudame
    Mandar Sudame over 8 years
    Yes the default is UTF-16. I am not making any assumptions on Encoding of the input string.
  • Tom Blodget
    Tom Blodget over 8 years
    There is no text but encoded text. Your input is type string and is therefore UTF-16. UTF-16 is not the default; there is no choice about it. You then split into char[], UTF-16 code units. You then call Convert.ToByte(Char), which just happens to convert U+0000 to U+00FF to ISO-8859-1, and mangles any other codepoints.
  • Mandar Sudame
    Mandar Sudame over 8 years
    Makes sense. Thanks for the clarification. Updating my answer.
  • Tom Blodget
    Tom Blodget over 8 years
    I think you are still missing several essential points. Focus on char being 16 bits and Convert.ToByte() throwing half of them away.
  • Mandar Sudame
    Mandar Sudame over 8 years
    Thanks for catching that. My solution will work only if the chars can be represented by 1 byte (ASCII)
  • Andi AR
    Andi AR over 7 years
    This solution doesnt work with this string "㯪" . but #Eran Yogev solution works.
  • OzBob
    OzBob over 7 years
    @AndiAR try Encoding.UTF8.GetBytes(somestring)
  • Gerard ONeill
    Gerard ONeill over 7 years
    This will fail for characters that fall into the surrogate pair range.. GetBytes will have a byte array that misses one normal char per surrogate pair off the end. The GetString will have empty chars at the end. The only way it would work is if microsoft's default were UTF32, or if characters in the surrogate pair range were not allowed. Or is there something I'm not seeing? The proper way is to 'encode' the string into bytes.
  • Eran Yogev
    Eran Yogev over 7 years
    Correct, for a wider range you can use something similar to #Timothy Randall's solution: using System; using System.Text; namespace Example{ public class Program { public static void Main(string[] args) { string s1 = "Hello World"; string s2 = "שלום עולם"; string s3 = "你好,世界!"; Console.WriteLine(Encoding.UTF8.GetString(Encoding.UTF8.GetB‌​ytes(s1))); Console.WriteLine(Encoding.UTF8.GetString(Encoding.UTF8.GetB‌​ytes(s2))); Console.WriteLine(Encoding.UTF8.GetString(Encoding.UTF8.GetB‌​ytes(s3))); } } }
  • Jacklynn
    Jacklynn almost 7 years
    Convert.ToByte(char) doesn't work like you think it would. The character '2' is converted to the byte 2, not the byte that represents the character '2'. Use mystring.Select(x => (byte)x).ToArray() instead.
  • T Blank
    T Blank almost 7 years
    I'd rename that method to include the fact that it's using ASCII encoding. Something like ToASCIIByteArray. I hate when I find out some library I'm using uses ASCII and I'm assuming it's using UTF-8 or something more modern.
  • royalTS
    royalTS over 6 years
    If you do not care about the encoding, you could use Encoding.Default.GetBytes()
  • Jeff
    Jeff about 6 years
    For my situation I found that Encoding.Unicode.GetBytes worked (but ASCII didn't)
  • Aisah Hamzah
    Aisah Hamzah over 5 years
    This should be the accepted answer, as the other answers suggest ASCII, but the encoding is either Unicode (which it UTF16) or UTF8.
  • Douglas Gaskell
    Douglas Gaskell about 5 years
    Note that using Encoding encoding = Encoding.Default results in a compile time error: CS1736 Default parameter value for 'encoding' must be a compile-time constant
  • Welcor
    Welcor over 4 years
    that only works when your string only contains a-z, A-Z, 0-9, +, /. No other characters are allowed de.wikipedia.org/wiki/Base64
  • astef
    astef almost 4 years
    @EranYogev why it should fail? I have tested it for the whole range of System.Int32 and it was correct. Can you please explain here or in this question: stackoverflow.com/questions/64077979/…
  • Faither
    Faither over 3 years
    Indeed, @Abel. The C# currently uses UTF-16 as default and encoding such makes sense more than ASCII. Depends of project of course, but this is default.
  • Elikill58
    Elikill58 almost 3 years
    Are you sure to be in the right place ? Why don't comment instead of posting a new answer that just add precision to another answer ?
  • Admin
    Admin almost 3 years
    Please provide additional details in your answer. As it's currently written, it's hard to understand your solution.
  • OwnageIsMagic
    OwnageIsMagic over 2 years
    Actually .Net and Windows use UTF-16 internally for strings. Win32 API can also accept strings encoded in Active Code Page (ACP) which it converts to UTF-16. OEM codepages are only used for console I/O.
  • Jan Turoň
    Jan Turoň over 2 years
    @OwnageIsMagic UTF-16LE is sometimes used internally for strings, but .NET interface uses UTF-8 as default, I added a note about Encoding.Unicode in the answer.
  • OwnageIsMagic
    OwnageIsMagic over 2 years
    How about some fact checking? github.com/dotnet/runtime/blob/… it uses WCHAR type which means 16 bits per character (UTF-16). Also sizeof(char) in C# is 2
  • OwnageIsMagic
    OwnageIsMagic over 2 years
    I'm quite sure that even CLR specification enforces use of UTF16 for System.String encoding. And regardless encoding of source file: it completely irrelevant. Compiler converts source file encoding to UTF16 during compilation.
  • OwnageIsMagic
    OwnageIsMagic over 2 years
    You can specify source file encoding with docs.microsoft.com/en-us/dotnet/csharp/language-reference/… -codepage flag to Roslyn compiler. The compiler will first attempt to interpret all source files as UTF-8. If your source code files are in an encoding other than UTF-8 and use characters other than 7-bit ASCII characters, use the CodePage option to specify which code page should be used.
  • Jan Turoň
    Jan Turoň over 2 years
    @OwnageIsMagic thanks for the links, I attributed this piece to you in the answer.
  • Timothy C. Quinn
    Timothy C. Quinn about 2 years
    The question does not relate to Base64 Strings which have unique character restrictions.
  • Timothy C. Quinn
    Timothy C. Quinn about 2 years
    The question does not relate to Base64 Strings which have unique character restrictions.
  • Nick Turner
    Nick Turner about 2 years
    This answers it the best in a clean concise way. Could be in the other answer, but too much theory in that one.