Converting UTF-16 to UTF-8

python encoding utf-8 utf-16

21,725

Your string appears to have been encoded using utf-16be:

In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035

So, in order to convert it to utf-8, you first need to decode it, then encode it:

In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'

In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'

or, in one step:

In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'

21,725

Author by

Juicy

Updated on July 04, 2020

Comments

Juicy almost 4 years

I've loading a string from a file. When I print out the string with:

print my_string
print binascii.hexlify(my_string)

I get:

2DF5
0032004400460035

Meaning this string is UTF-16. I would like to convert this string to UTF-8 so that the above code produces this output:

2DF5
32444635

I've tried:

my_string.decode('utf-8')

Which output:

32004400460035

EDIT:

Here's a quick sample:

    hello = 'hello'.encode('utf-16')
    print hello
    print binascii.hexlify(hello)

    hello = hello[2:].decode('utf-8')
    print hello
    print binascii.hexlify(hello)

Which produces this output:

��hello
fffe680065006c006c006f00
hello
680065006c006c006f00

Expected output would be:

��hello
fffe680065006c006c006f00
hello
68656c6c6f

Martijn Pieters almost 9 years

But take into account there could be a BOM in the actual, real-world data.
Juicy almost 9 years

Thank you, I was not aware of UTF-16be and that was the issue!
Martijn Pieters almost 9 years

@Juicy: Note that you have a BOM in your actual data; there is no need to pick be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.
Juicy almost 9 years

@MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!