Convert.FromBase64String returns unicode sometimes, or UTF-8

22,656

Solution 1

First of all, I want to debunk the title of the question:

Convert.FromBase64String() returns Unicode sometimes, or UTF-8

That is not the case. Give then same input, valid base64 encoded text, Convert.FromBase64String() always returns the same output.

Moving on, you cannot determine definitively, just by examining the payload, the encoding used for a string. You attempt to do this with

if (b64[1] == 0)
    // encoding must be UTF-16

This is not the case. The overwhelming majority of UTF-16 character elements fail that test. It does not matter how you try to write this test it is doomed to fail. And that is because there exist byte arrays that are well-defined strings when interpreted as different encodings. In other words it is possible, for instance, to construct byte arrays that are valid when considered as either UTF-8 or UTF-16.

So, you have to know a priori whether the payload is encoded as UTF-16, UTF-8 or indeed some other encoding.

The solution will be to keep track of the original encoding, before the base64 encoding. Pass that information along with the base64 encoded payload. Then when you decode, you can determine which Encoding to use to decode back to a string.

It looks to me very much that your strings are all coming from UTF-16 .net strings. In which case you won't have UTF-8 strings ever, and should always decode with UTF-16. That is you use Encoding.Unicode.GetString().

Also, the GetBytes method in your code is poor. It should be:

public static byte[] GetBytes(this string str)
{
    return Encoding.Unicode.GetBytes(str);
}

Another oddity:

String corrected = new string(input.ToCharArray());

This is a no-op.

Finally, it is quite likely that your text will be more compact when encoded as UTF-8. So perhaps you should consider doing that before applying the base64 encoding.


Regarding your update, what you state is incorrect. This code:

string str = Encoding.Unicode.GetString(
    Convert.FromBase64String("cABhAHMAcwB3AG8AcgBkADEA"));

assigns password1 to str wherever it is run.

Solution 2

Try revising the code to make it a little more readable/accurate. As mentioned in my comment and David Hefferman's answer you're trying to do things that either:

A) do nothing

or

B) demonstrate flawed logic

The following code based upon yours works fine:

class Program
{
    static void Main(string[] args)
    {
        string original = "password1";
        string encoded = original.ToBase64();
        string decoded = encoded.FromBase64();
        Console.WriteLine("Original: {0}", original);
        Console.WriteLine("Encoded: {0}", encoded);
        Console.WriteLine("Decoded: {0}", decoded);
    }
}

public static class Extensions
{
    public static string FromBase64(this string input)
    {
        return System.Text.Encoding.Unicode.GetString(Convert.FromBase64String(input));
    }

    public static string ToBase64(this string input)
    {
        return Convert.ToBase64String(input.GetBytes());
    }

    public static byte[] GetBytes(this string str)
    {
        return System.Text.Encoding.Unicode.GetBytes(str);
    }
}
Share:
22,656
david.tanner
Author by

david.tanner

Updated on January 22, 2020

Comments

  • david.tanner
    david.tanner over 4 years

    Sometimes the byte array b64 is UTF-8, and other times is UTF-16. I keep reading online that C# strings are always UTF-16, but that is not the case for me here. Why is this happening, and how do I fix it? I have a simple method for converting a base64 string to a normal string:

    public static string FromBase64(this string input)
    {
        String corrected = new string(input.ToCharArray());
        byte[] b64 = Convert.FromBase64String(corrected);
        if (b64[1] == 0)
        {
            return System.Text.Encoding.Unicode.GetString(b64);
        }
        else
        {
            return System.Text.Encoding.UTF8.GetString(b64);
        }
    
    }
    

    The same thing is happening to my base 64 encoder:

    public static string ToBase64(this string input)
    {
        String b64 = Convert.ToBase64String(input.GetBytes());
        return b64;
    }
    
    public static byte[] GetBytes(this string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }
    

    Example: On my computer, "cABhAHMAcwB3AG8AcgBkADEA" decodes to:

    'p','\0','a','\0','s','\0','s','\0','w','\0','o','\0','r','\0','d','\0','1','\0'
    

    But on my coworkers computer it is:

    'p','a','s','s','w','o','r','d','1'
    

    Edit:

    I know that the string I create comes from a textbox, and that the file where I am saving it to is always going to be UTF-8, so everything is pointing to the Convert method causing my encoding switch.

    Update:

    After digging in further, it appears that my coworker had a very important line commented in his version of the code, the one that saves the value read from file to the hashtable. The default value I was using is a UTF-8 base64 value, so I am going to correct the default, to a utf-16 value, then I can clean up the code removing any UTF8 references.

    Also, I had been naively using the UTF-8 base64 encoding I had retrieved from a website, not realizing what I was getting myself into. The funny part is I would never have found that fact if my coworker hadn't commented the line that saves the values from the file.

    Final version of the code:

    public static string FromBase64(this string input)
    {
        byte[] b64 = Convert.FromBase64String(input);
        return System.Text.Encoding.Unicode.GetString(b64);
    }
    
    public static string ToBase64(this string input)
    {
        String b64 = Convert.ToBase64String(input.GetBytes());
        return b64;
    }
    
    public static byte[] GetBytes(this string str)
    {
        return System.Text.Encoding.Unicode.GetBytes(str);
    }