Python Unicode Encode Error

268,144

Solution 1

Likely, your problem is that you parsed it okay, and now you're trying to print the contents of the XML and you can't because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the 'ignore' part will tell it to just skip those characters. From the python docs:

>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what's going on. After the read, you'll stop feeling like you're just guessing what commands to use (or at least that happened to me).

Solution 2

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

Solution 3

Don't hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is -- your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars -- set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.

Solution 4

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode
Share:
268,144

Related videos on Youtube

Alex B
Author by

Alex B

Updated on July 08, 2022

Comments

  • Alex B
    Alex B almost 2 years

    I'm reading and parsing an Amazon XML file and while the XML file shows a ' , when I try to print it I get the following error:

    'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 
    

    From what I've read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?

    • Nick Heiner
      Nick Heiner almost 14 years
      I was just coming to SO to post this question. Is there an easy way to sanitize a string for unicode()?
    • Ishbir
      Ishbir almost 14 years
      Please check also this answer to a related question: “Python UnicodeDecodeError - Am I misunderstanding encode?”
  • Alex B
    Alex B almost 14 years
    It's already encoded in UTF-8 The error is specifically: myStrings = deque([u'Dorf and Svoboda\u2019s text builds on the str... and Computer Engineering\u2019s subdisciplines.']) The string is in UTF-8 as you can see, but it gets mad about the internal '\u2019'
  • David Z
    David Z almost 14 years
    Oh, OK, I thought you were having a different problem.
  • Nick Heiner
    Nick Heiner almost 14 years
    I'm trying to make the following string safe: ' foo “bar bar” df'(note the curly quotes), but the above still fails for me.
  • sth
    sth almost 14 years
    @Alex B: No, the string is Unicode, not Utf-8. To encode it as Utf-8 use '...'.encode('utf-8')
  • Scott Stafford
    Scott Stafford almost 14 years
    @Rosarch: Fails how? same error? And which error-handling rule did you use?
  • Scott Stafford
    Scott Stafford almost 14 years
    @Rosarch, your problem is probably earlier. Try this code: # -- coding: latin-1 -- u = u' foo “bar bar” df' print u.encode('ascii', 'ignore') For you, it was probably converting your string INTO unicode given the encoding you specified for the python scrip that threw the error.
  • Nick Heiner
    Nick Heiner almost 14 years
    I went ahead and made my issue into its own question: stackoverflow.com/questions/3224427/…
  • jfs
    jfs almost 9 years
    It does not help with OP's issue: "can't encode character u'\u2019'". u'\u2019 is already Unicode.
  • jfs
    jfs almost 9 years
    .encode('ascii', 'ignore') loses data unnecessarily even if OP's environment may support non-ascii characters (most cases)
  • Sankalp
    Sankalp over 8 years
    This solved my problem after struggling with different things for one hour.