Reading a UTF8 CSV file with Python

294,928

Solution 1

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
  print field1, field2, field3 

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

Solution 2

Python 2.X

There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.

Here is a example from their readme:

>>> import unicodecsv
>>> from cStringIO import StringIO
>>> f = StringIO()
>>> w = unicodecsv.writer(f, encoding='utf-8')
>>> w.writerow((u'é', u'ñ'))
>>> f.seek(0)
>>> r = unicodecsv.reader(f, encoding='utf-8')
>>> row = r.next()
>>> print row[0], row[1]
é ñ

Python 3.X

In python 3 this is supported out of the box by the build-in csv module. See this example:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Solution 3

If you want to read a CSV File with encoding utf-8, a minimalistic approach that I recommend you is to use something like this:

with open(file_name, encoding="utf8") as csv_file:

With that statement, you can use later a CSV reader to work with.

Solution 4

Also checkout the answer in this post: https://stackoverflow.com/a/9347871/1338557

It suggests use of library called ucsv.py. Short and simple replacement for CSV written to address the encoding problem(utf-8) for Python 2.7. Also provides support for csv.DictReader

Edit: Adding sample code that I used:

import ucsv as csv

#Read CSV file containing the right tags to produce
fileObj = open('awol_title_strings.csv', 'rb')
dictReader = csv.DictReader(fileObj, fieldnames = ['titles', 'tags'], delimiter = ',', quotechar = '"')
#Build a dictionary from the CSV file-> {<string>:<tags to produce>}
titleStringsDict = dict()
for row in dictReader:
    titleStringsDict.update({unicode(row['titles']):unicode(row['tags'])})

Solution 5

Using codecs.open as Alex Martelli suggested proved to be useful to me.

import codecs

delimiter = ';'
reader = codecs.open("your_filename.csv", 'r', encoding='utf-8')
for line in reader:
    row = line.split(delimiter)
    # do something with your row ...
Share:
294,928

Related videos on Youtube

Martin
Author by

Martin

Updated on December 18, 2021

Comments

  • Martin
    Martin over 2 years

    I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.

    def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
        # csv.py doesn't do Unicode; encode temporarily as UTF-8:
        csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                                dialect=dialect, **kwargs)
        for row in csv_reader:
            # decode UTF-8 back to Unicode, cell by cell:
            yield [unicode(cell, 'utf-8') for cell in row]
    
    def utf_8_encoder(unicode_csv_data):
        for line in unicode_csv_data:
            yield line.encode('utf-8')
    
    filename = 'output.csv'
    reader = unicode_csv_reader(open(filename))
    try:
        products = []
        for field1, field2, field3 in reader:
            ...
    

    Below is an extract of the CSV file I am trying to read:

    0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu
    0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris
    0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert
    ...
    

    Even though I try to encode/decode to UTF-8, I am still getting the following exception:

    Traceback (most recent call last):
      File ".\Test.py", line 53, in <module>
        for field1, field2, field3 in reader:
      File ".\Test.py", line 40, in unicode_csv_reader
        for row in csv_reader:
      File ".\Test.py", line 46, in utf_8_encoder
        yield line.encode('utf-8', 'ignore')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)
    

    How do I fix this?

    • Antti Haapala -- Слава Україні
      Antti Haapala -- Слава Україні about 8 years
      Martin, if you're around, would you consider switching the accepted answer from Martelli's Python 2 only answer.
  • jb.
    jb. about 11 years
    It wouldn't work with all CSV, following is a valid csv row: "Foo Bar; Baz"; 231; 313; ";;;"; 1;
  • Anentropic
    Anentropic about 10 years
    Does this mean the example in the python docs (where OP copy & pasted from) is wrong? What is the point of the extra encoding step it does if it breaks when you give it a unicode csv?
  • Yaje
    Yaje almost 10 years
    you should put some details of that link in your answer, just in case the link goes broken\
  • Atripavan
    Atripavan almost 10 years
    #Downvoter- Not sure why you thought its of no use. The ucsv library worked just fine for me. Helped resolve the unicde error that I had been struggling with since 2 days. If you were looking for some sample code, here it goes in the edit @Yaje- I have given some details; also the sample code. And corrected the link as well, that was earlier pointing to some other post.
  • eis
    eis over 8 years
    I wonder which version of python would this work in? I get errors with both 2.7 and 3.5. "ValueError: not enough values to unpack (expected 3, got 1)"
  • van
    van over 8 years
    @eis: I can imagine that on your system comma is not a default delimiter. Try to add delimiter=',' instead of dialect=csv.excel.
  • Antti Haapala -- Слава Україні
    Antti Haapala -- Слава Україні about 8 years
  • Christophe Roussy
    Christophe Roussy over 7 years
    You import the csv module but do not use it.
  • Codeguy007
    Codeguy007 over 5 years
    Any particular reason you are opening a text file as a binary? 'rb' is for opening binary files.
  • Zvika
    Zvika over 5 years
    Is it possible that this is Python 3 only? It fails for me, in Python 2. It doesn't accept the encoding in open
  • luca76
    luca76 over 4 years
    @Zvika yes, in python 3 this solution works: open('file.csv', 'r', encoding="ISO8859")
  • Jimmy Lee Jones
    Jimmy Lee Jones about 4 years
    I would also add open(file_name, "rt", encoding='utf-8'), that is, open file in "read text" mode
  • Bob Stein
    Bob Stein about 3 years
    encoding='utf-8-sig' helps if your CSV file has a BOM prefix U+FEFF. Opening the file with that encoding will automatically strip the BOM. Otherwise it confuses csv into thinking the first field name starts with the BOM character and it fails to strip the quotes, and so reader.fieldnames[0] can be '\ufeff"Date"' instead of 'Date'.
  • Louis Cottereau
    Louis Cottereau about 3 years
    @JimmyLeeJones 'r' and 'rt' are the same since by default open use "read text" mode