How do I convert unicode string with cp1252 characters into UTF-8 with Python?

10,137

It seems your string was decoded with latin1 (as it is of type unicode)

  1. To convert it back to the bytes it originally was, you need to encode using that encoding (latin1)
  2. Then to get text back (unicode) you must decode using the proper codec (cp1252)
  3. finally, if you want to get to utf-8 bytes you must encode using the UTF-8 codec.

In code:

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

Depending on whether the API takes text (unicode) or bytes, 3. may not be necessary.

Share:
10,137
ninapavlich
Author by

ninapavlich

Updated on June 16, 2022

Comments

  • ninapavlich
    ninapavlich almost 2 years

    I am getting text through an API that returns characters with a windows encoded apostrophe (\x92):

    > python
    >>> title = u'There\x92s thirty days in June'
    >>> title
    u'There\x92s thirty days in June'
    >>> print title
    Theres thirty days in June
    >>> type(title)
    <type 'unicode'>
    

    I'm trying to convert this string to UTF-8 so that it instead returns: "There’s thirty days in June"

    When I try to decode or encode this unicode string, it throws an error:

    >>> title.decode('cp1252')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
        return codecs.charmap_decode(input,errors,decoding_table)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)
    
    >>> title.encode("cp1252").decode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
        return codecs.charmap_encode(input,errors,encoding_table)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>
    

    If I were to initialize the string as plain-text and then decode it, it works:

    >>>title = 'There\x92s thirty days in June'
    >>> type(title)
    <type 'str'>
    >>>print title.decode('cp1252')
    There’s thirty days in June
    

    My question is how do I convert the unicode string that I'm getting into a plain-text string so that I can decode it?