How to convert a UTF-8 string into Unicode?

137,069

Solution 1

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).


Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

Solution 2

I have string that displays UTF-8 encoded characters

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

Solution 3

If you have a UTF-8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the following:

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

In my case the DLL result is a UTF-8 string too, but unfortunately the UTF-8 string is interpreted with UTF-16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 was converted to the UTF-16 '–' which is 8211. If you have this case too, you can use the following instead:

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

Or the Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf16ToUtf8. Hope I could be of help.

Solution 4

What you have seems to be a string incorrectly decoded from another encoding, likely code page 1252, which is US Windows default. Here's how to reverse, assuming no other loss. One loss not immediately apparent is the non-breaking space (U+00A0) at the end of your string that is not displayed. Of course it would be better to read the data source correctly in the first place, but perhaps the data source was stored incorrectly to begin with.

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "déjÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

Result:

déjà
Share:
137,069
remio
Author by

remio

Updated on October 14, 2020

Comments

  • remio
    remio over 3 years

    I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

    For now, my implementation is the following:

    public static string DecodeFromUtf8(this string utf8String)
    {
        // read the string as UTF-8 bytes.
        byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
    
        // convert them into unicode bytes.
        byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
    
        // builds the converted string.
        return Encoding.Unicode.GetString(encodedBytes);
    }
    

    I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".

    Unfortunately, with this implementation the string just remains the same.

    Where am I wrong?