UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

353,522

Solution 1

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.

Solution 2

Just add this lines to your codes :

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Solution 3

for Python 3 users. you can do

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)

Solution 4

The main reason for the error is that the default encoding assumed by python is ASCII. Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range e.g. for a string like 'hgvcj터파크387', python would throw error because the string is not in the expected encoding format.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode e.g.

name = school_name.decode('utf8').encode('utf8')

Solution 5

My computer had the wrong locale set.

I first did

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

locale.getpreferredencoding(False) is the function called by open() when you don't provide an encoding. The output should be 'UTF-8', but in this case it's some variant of ASCII.

Then I ran the bash command locale and got this output

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, I was using the default Ubuntu locale, which causes Python to open files as ASCII instead of UTF-8. I had to set my locale to en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you can't change the locale system wide, you can invoke all your Python code like this:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or do

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.

Share:
353,522
jelkimantis
Author by

jelkimantis

I am an educational researcher currently interested in studying social studies and the influence of digital history on the process of supporting student's ability to acquire historical thinking skills.

Updated on March 23, 2022

Comments

  • jelkimantis
    jelkimantis about 2 years

    I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

    I open the CSV using:

     15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')
    

    Then, I attempt to encode it with:

    name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
    

    I'm encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

    Traceback (most recent call last):
      File "push_into_db.py", line 80, in <module>
        main()
      File "push_into_db.py", line 74, in main
        district_map = buildDistrictSchoolMap()
      File "push_into_db.py", line 32, in buildDistrictSchoolMap
        county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
    

    I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4. I've read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

    You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

  • agf
    agf almost 12 years
    The reason for the error being that Python is trying to automatically decode it from the default encoding, ASCII, so that it can then encode it as he specified, to UTF-8. Since the data isn't valid ASCII, it doesn't work.
  • ch3ka
    ch3ka almost 12 years
    sure, but if it's UTF8-encoded data (as I guess), then .decode('utf-8') should do the trick, nor?
  • agf
    agf almost 12 years
    Sure, you're probably right. I was just explaining why you get that specific error in this situation.
  • jelkimantis
    jelkimantis almost 12 years
    Perfect! Thank you very much. So it turns out that it was .decode('latin-1') -- this makes sense because it was Ñ that was giving me the problem. Again! Thank you!
  • Vikash Mishra
    Vikash Mishra over 7 years
    Your solution works for some cases, but in case if I use this then I get another error 'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)
  • Yasin
    Yasin almost 7 years
    This is not the case always. The 2nd answer worked for me
  • khelili miliana
    khelili miliana almost 7 years
    what is the difference between your answer and mine
  • Temi Fakunle
    Temi Fakunle almost 7 years
    More detailed. People often find causal details helpful. And your code works btw, no derogation intended.
  • Meow
    Meow over 6 years
    reload is available in Python 3 you would just have to import it. from imp import reload
  • Skrmnghrd
    Skrmnghrd over 6 years
    Its the first time I helped someone through here. feels good knowing I helped :)
  • skjerns
    skjerns about 6 years
    `AttributeError: module 'sys' has no attribute 'setdefaultencoding' does not seem to work in Python 3
  • George Chalhoub
    George Chalhoub about 6 years
    Woot woot! This helped me.
  • Yu Shen
    Yu Shen about 6 years
    It works for my Python 2.7, note, reload(sys) is needed, otherwise, setdefaultencoding would not be accessible.
  • Freedo
    Freedo almost 5 years
    That was the only thing that made it work for me out of many SO questions. Thanks so much!
  • user2194898
    user2194898 over 4 years
    And you helped also to me :) All other answers did not work for file reading. Now I need to find out how to fix it also for writing ;)
  • Skrmnghrd
    Skrmnghrd over 4 years
    can you send me the link of your code? I'll try to help
  • Davide
    Davide almost 4 years
    name 'reload' is not defined
  • Viorel Stoianov
    Viorel Stoianov almost 4 years
    W00t for proselint/tools.py as well.
  • Konst54
    Konst54 almost 4 years
    @Meow but there is no sys.setdefaultencoding in Python 3. So in context of compatibility py2\py3 some check will do, sys.getdefaultencoding() maybe. Would appreciate a piece of advice about that matter. stackoverflow.com/questions/28127513/…
  • zyd
    zyd about 3 years
    @Davide - from importlib import reload
  • Aman Jain
    Aman Jain almost 3 years
    For python3, see the @Skrmnghrd answer
  • Bilguun
    Bilguun almost 3 years
    Thanks! I forgot to include the 'encoding="utf-8"' part!