Converting UTF-16 to UTF-8

21,725

Your string appears to have been encoded using utf-16be:

In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035

So, in order to convert it to utf-8, you first need to decode it, then encode it:

In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'

In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'

or, in one step:

In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'
Share:
21,725
Juicy
Author by

Juicy

Updated on July 04, 2020

Comments

  • Juicy
    Juicy almost 4 years

    I've loading a string from a file. When I print out the string with:

    print my_string
    print binascii.hexlify(my_string)
    

    I get:

    2DF5
    0032004400460035
    

    Meaning this string is UTF-16. I would like to convert this string to UTF-8 so that the above code produces this output:

    2DF5
    32444635
    

    I've tried:

    my_string.decode('utf-8')
    

    Which output:

    32004400460035
    

    EDIT:

    Here's a quick sample:

        hello = 'hello'.encode('utf-16')
        print hello
        print binascii.hexlify(hello)
    
        hello = hello[2:].decode('utf-8')
        print hello
        print binascii.hexlify(hello)
    

    Which produces this output:

    ��hello
    fffe680065006c006c006f00
    hello
    680065006c006c006f00
    

    Expected output would be:

    ��hello
    fffe680065006c006c006f00
    hello
    68656c6c6f
    
  • Martijn Pieters
    Martijn Pieters almost 9 years
    But take into account there could be a BOM in the actual, real-world data.
  • Juicy
    Juicy almost 9 years
    Thank you, I was not aware of UTF-16be and that was the issue!
  • Martijn Pieters
    Martijn Pieters almost 9 years
    @Juicy: Note that you have a BOM in your actual data; there is no need to pick be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.
  • Juicy
    Juicy almost 9 years
    @MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!