Reading a UTF8 CSV file with Python

python utf-8 csv character-encoding

294,928

Solution 1

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
  print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

Solution 2

Python 2.X

There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.

Here is a example from their readme:

>>> import unicodecsv
>>> from cStringIO import StringIO
>>> f = StringIO()
>>> w = unicodecsv.writer(f, encoding='utf-8')
>>> w.writerow((u'é', u'ñ'))
>>> f.seek(0)
>>> r = unicodecsv.reader(f, encoding='utf-8')
>>> row = r.next()
>>> print row[0], row[1]
é ñ

Python 3.X

In python 3 this is supported out of the box by the build-in csv module. See this example:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Solution 3

If you want to read a CSV File with encoding utf-8, a minimalistic approach that I recommend you is to use something like this:

with open(file_name, encoding="utf8") as csv_file:

With that statement, you can use later a CSV reader to work with.

Solution 4

Also checkout the answer in this post: https://stackoverflow.com/a/9347871/1338557

It suggests use of library called ucsv.py. Short and simple replacement for CSV written to address the encoding problem(utf-8) for Python 2.7. Also provides support for csv.DictReader

Edit: Adding sample code that I used:

import ucsv as csv

#Read CSV file containing the right tags to produce
fileObj = open('awol_title_strings.csv', 'rb')
dictReader = csv.DictReader(fileObj, fieldnames = ['titles', 'tags'], delimiter = ',', quotechar = '"')
#Build a dictionary from the CSV file-> {<string>:<tags to produce>}
titleStringsDict = dict()
for row in dictReader:
    titleStringsDict.update({unicode(row['titles']):unicode(row['tags'])})

Solution 5

Using codecs.open as Alex Martelli suggested proved to be useful to me.

import codecs

delimiter = ';'
reader = codecs.open("your_filename.csv", 'r', encoding='utf-8')
for line in reader:
    row = line.split(delimiter)
    # do something with your row ...

View more solutions

294,928

Martin

Updated on December 18, 2021

Comments

Martin over 2 years

I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

filename = 'output.csv'
reader = unicode_csv_reader(open(filename))
try:
    products = []
    for field1, field2, field3 in reader:
        ...

Below is an extract of the CSV file I am trying to read:

0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu
0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris
0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert
...

Even though I try to encode/decode to UTF-8, I am still getting the following exception:

Traceback (most recent call last):
  File ".\Test.py", line 53, in <module>
    for field1, field2, field3 in reader:
  File ".\Test.py", line 40, in unicode_csv_reader
    for row in csv_reader:
  File ".\Test.py", line 46, in utf_8_encoder
    yield line.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)

How do I fix this?

Antti Haapala -- Слава Україні about 8 years

Martin, if you're around, would you consider switching the accepted answer from Martelli's Python 2 only answer.

jb. about 11 years

It wouldn't work with all CSV, following is a valid csv row: "Foo Bar; Baz"; 231; 313; ";;;"; 1;
Anentropic about 10 years

Does this mean the example in the python docs (where OP copy & pasted from) is wrong? What is the point of the extra encoding step it does if it breaks when you give it a unicode csv?
Yaje almost 10 years

you should put some details of that link in your answer, just in case the link goes broken\
Atripavan almost 10 years

#Downvoter- Not sure why you thought its of no use. The ucsv library worked just fine for me. Helped resolve the unicde error that I had been struggling with since 2 days. If you were looking for some sample code, here it goes in the edit @Yaje- I have given some details; also the sample code. And corrected the link as well, that was earlier pointing to some other post.
eis over 8 years

I wonder which version of python would this work in? I get errors with both 2.7 and 3.5. "ValueError: not enough values to unpack (expected 3, got 1)"
van over 8 years

@eis: I can imagine that on your system comma is not a default delimiter. Try to add delimiter=',' instead of dialect=csv.excel.
Antti Haapala -- Слава Україні about 8 years

Please see this answer for Python 3
Christophe Roussy over 7 years

You import the csv module but do not use it.
Codeguy007 over 5 years

Any particular reason you are opening a text file as a binary? 'rb' is for opening binary files.
Zvika over 5 years

Is it possible that this is Python 3 only? It fails for me, in Python 2. It doesn't accept the encoding in open
luca76 over 4 years

@Zvika yes, in python 3 this solution works: open('file.csv', 'r', encoding="ISO8859")
Jimmy Lee Jones about 4 years

I would also add open(file_name, "rt", encoding='utf-8'), that is, open file in "read text" mode
Bob Stein about 3 years

encoding='utf-8-sig' helps if your CSV file has a BOM prefix U+FEFF. Opening the file with that encoding will automatically strip the BOM. Otherwise it confuses csv into thinking the first field name starts with the BOM character and it fails to strip the quotes, and so reader.fieldnames[0] can be '\ufeff"Date"' instead of 'Date'.
Louis Cottereau about 3 years

@JimmyLeeJones 'r' and 'rt' are the same since by default open use "read text" mode