Reading a UTF8 CSV file with Python
Solution 1
The .encode
method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs
module in the standard library and codecs.open
in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv
module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:
import csv
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
print field1, field2, field3
PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv
module level), of the form line.decode('whateverweirdcodec').encode('utf-8')
-- but probably you can just use the name of your existing encoding in the yield
line in my code above, instead of 'utf-8'
, as csv
is actually going to be just fine with ISO-8859-* encoded bytestrings.
Solution 2
Python 2.X
There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.
Here is a example from their readme:
>>> import unicodecsv
>>> from cStringIO import StringIO
>>> f = StringIO()
>>> w = unicodecsv.writer(f, encoding='utf-8')
>>> w.writerow((u'é', u'ñ'))
>>> f.seek(0)
>>> r = unicodecsv.reader(f, encoding='utf-8')
>>> row = r.next()
>>> print row[0], row[1]
é ñ
Python 3.X
In python 3 this is supported out of the box by the build-in csv
module. See this example:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
Solution 3
If you want to read a CSV File with encoding utf-8, a minimalistic approach that I recommend you is to use something like this:
with open(file_name, encoding="utf8") as csv_file:
With that statement, you can use later a CSV reader to work with.
Solution 4
Also checkout the answer in this post: https://stackoverflow.com/a/9347871/1338557
It suggests use of library called ucsv.py. Short and simple replacement for CSV written to address the encoding problem(utf-8) for Python 2.7. Also provides support for csv.DictReader
Edit: Adding sample code that I used:
import ucsv as csv
#Read CSV file containing the right tags to produce
fileObj = open('awol_title_strings.csv', 'rb')
dictReader = csv.DictReader(fileObj, fieldnames = ['titles', 'tags'], delimiter = ',', quotechar = '"')
#Build a dictionary from the CSV file-> {<string>:<tags to produce>}
titleStringsDict = dict()
for row in dictReader:
titleStringsDict.update({unicode(row['titles']):unicode(row['tags'])})
Solution 5
Using codecs.open
as Alex Martelli suggested proved to be useful to me.
import codecs
delimiter = ';'
reader = codecs.open("your_filename.csv", 'r', encoding='utf-8')
for line in reader:
row = line.split(delimiter)
# do something with your row ...
Related videos on Youtube
Martin
Updated on December 18, 2021Comments
-
Martin over 2 years
I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'output.csv' reader = unicode_csv_reader(open(filename)) try: products = [] for field1, field2, field3 in reader: ...
Below is an extract of the CSV file I am trying to read:
0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu 0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris 0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert ...
Even though I try to encode/decode to UTF-8, I am still getting the following exception:
Traceback (most recent call last): File ".\Test.py", line 53, in <module> for field1, field2, field3 in reader: File ".\Test.py", line 40, in unicode_csv_reader for row in csv_reader: File ".\Test.py", line 46, in utf_8_encoder yield line.encode('utf-8', 'ignore') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)
How do I fix this?
-
Antti Haapala -- Слава Україні about 8 yearsMartin, if you're around, would you consider switching the accepted answer from Martelli's Python 2 only answer.
-
-
jb. about 11 yearsIt wouldn't work with all CSV, following is a valid csv row: "Foo Bar; Baz"; 231; 313; ";;;"; 1;
-
Anentropic about 10 yearsDoes this mean the example in the python docs (where OP copy & pasted from) is wrong? What is the point of the extra encoding step it does if it breaks when you give it a unicode csv?
-
Yaje almost 10 yearsyou should put some details of that link in your answer, just in case the link goes broken\
-
Atripavan almost 10 years#Downvoter- Not sure why you thought its of no use. The ucsv library worked just fine for me. Helped resolve the unicde error that I had been struggling with since 2 days. If you were looking for some sample code, here it goes in the edit @Yaje- I have given some details; also the sample code. And corrected the link as well, that was earlier pointing to some other post.
-
eis over 8 yearsI wonder which version of python would this work in? I get errors with both 2.7 and 3.5. "ValueError: not enough values to unpack (expected 3, got 1)"
-
van over 8 years@eis: I can imagine that on your system comma is not a default delimiter. Try to add
delimiter=','
instead ofdialect=csv.excel
. -
Antti Haapala -- Слава Україні about 8 years
-
Christophe Roussy over 7 yearsYou import the
csv
module but do not use it. -
Codeguy007 over 5 yearsAny particular reason you are opening a text file as a binary? 'rb' is for opening binary files.
-
Zvika over 5 yearsIs it possible that this is Python 3 only? It fails for me, in Python 2. It doesn't accept the
encoding
inopen
-
luca76 over 4 years@Zvika yes, in python 3 this solution works:
open('file.csv', 'r', encoding="ISO8859")
-
Jimmy Lee Jones about 4 yearsI would also add open(file_name, "rt", encoding='utf-8'), that is, open file in "read text" mode
-
Bob Stein about 3 years
encoding='utf-8-sig'
helps if your CSV file has a BOM prefix U+FEFF. Opening the file with that encoding will automatically strip the BOM. Otherwise it confusescsv
into thinking the first field name starts with the BOM character and it fails to strip the quotes, and so reader.fieldnames[0] can be'\ufeff"Date"'
instead of'Date'
. -
Louis Cottereau about 3 years@JimmyLeeJones 'r' and 'rt' are the same since by default
open
use "read text" mode