How do I convert a unicode to a string at the Python level?

44,792

Solution 1

You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'

Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'    

Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

Solution 2

You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""

In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:

Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.

Initial state: you have a unicode object that you have named u1. It contains e-acute:

>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'

You encode u1 as UTF-8 and name the result s:

>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'

You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.

>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>

Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').

Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.

Solution 3

value_uni.encode('utf8') or whatever encoding you need.

See http://docs.python.org/library/stdtypes.html#str.encode

Solution 4

If you have u'Andr\xc3\xa9', that is a Unicode string that was decoded from a byte string with the wrong encoding. The correct encoding is UTF-8. To convert it back to a byte string so you can decode it correctly, you can use the trick you discovered. The first 256 code points of Unicode are a 1:1 mapping with ISO-8859-1 (alias latin1) encoding. So:

>>> u'Andr\xc3\xa9'.encode('latin1')
'Andr\xc3\xa9'

Now it is a byte string that can be decoded correctly with utf8:

>>> 'Andr\xc3\xa9'.decode('utf8')
u'Andr\xe9'
>>> print 'Andr\xc3\xa9'.decode('utf8')
André

In one step:

>>> print u'Andr\xc3\xa9'.encode('latin1').decode('utf8')
André

Solution 5

The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:

v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))

The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.

Share:
44,792
Thierry Lam
Author by

Thierry Lam

Currently playing with GO. When I have questions, I will try to give examples based on current events or food so that they are more exciting to answer. Plus I build stuffs that are useful to me.

Updated on July 11, 2020

Comments

  • Thierry Lam
    Thierry Lam almost 4 years

    The following unicode and string can exist on their own if defined explicitly:

    >>> value_str='Andr\xc3\xa9'
    >>> value_uni=u'Andr\xc3\xa9'
    

    If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?

    EDIT:

    I did the following:

    >>> value_uni.encode('latin-1')
    'Andr\xc3\xa9'
    

    which fixes my issue. Can someone explain to me what exactly is happening?