Reading russian language data from csv

python csv unicode python-2.x python-unicode

22,449

Solution 1

\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.

In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html

They propose a simple work around using:

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

This would allow you to do:

def loadCsv(filename):
    lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
    # if you really need lists then uncomment the next line
    # this will let you do call exact lines by doing `line_12 = lines[12]`
    # return list(lines)

    # this will return an "iterator", so that the file is read on each call
    # use this if you'll do a `for x in x`
    return lines

If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:

for csv_line in loadCsv("myfile.csv"):
    print u", ".join(csv_line)

If you need to write your results to another file (fairly typical), you could do:

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
    for csv_line in loadCsv("myfile.csv"):
        my_output.write(u", ".join(csv_line))

This will automatically convert/encode your output to UTF-8.

Solution 2

You cant try:

import pandas as pd 
pd.read_csv(path_file , "cp1251")

import csv
with open(path_file,  encoding="cp1251", errors='ignore') as source_file:
        reader = csv.reader(source_file, delimiter=",")

Solution 3

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

22,449

Author by

Erba Aitbayev

IT specialist.

Updated on July 05, 2022

Comments

Erba Aitbayev almost 2 years

I have some data in CSV file that are in Russian:

2-комнатная квартира РДТ',  мкр Тастак-3,  Аносова — Толе би;Алматы
2-комнатная квартира БГР',  мкр Таугуль,  Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ',  мкр Тастак-2,  Тлендиева — Райымбека;Алматы

Delimiter is ; symbol.

I want to read data and put it into array. I tried to read this data using this code:

def loadCsv(filename):
    lines = csv.reader(open(filename, "rb"),delimiter=";" )
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [str(x) for x in dataset[i]]
    return dataset

Then I read and print result:

mydata = loadCsv('krish(csv3).csv')
print mydata

Output:

[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3,  \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc,  \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2,  \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]

I found that in this case codecs are required and tried to do the same with this code:

import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
    text = f.read()
print text

I got this error:

newchars, decodedbytes = self.decode(data, self.errors)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte

What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.

jfs over 8 years

return ([f.decode('cp1251') if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, "rb"),delimiter=";"))

Alastair McCormack over 8 years

Sorry, see latest edit. Your code should use UnicodeReader()
Erba Aitbayev over 8 years

I have included UnicodeReader and UTF8Recoder in my code and tried to use loadCsv(). But data in dataset variable looks like this: u"2-\u043a\u043e\u043c\u043d\u0430\u0442. Is there something that I do wrong?
Alastair McCormack over 8 years

No, that's fine. It's because you're printing the whole line, which is a list, so you get a "representation". What you're seeing is Unicode literals, which means your data has been correctly decoded. This is a good thing! :) Try doing print line[0], which will encode the Unicode values to your console's locale
Alastair McCormack over 8 years

I've added some code to show how to iterate and join your results
Alastair McCormack over 8 years

Ah, yes. I see why that would happen. I've updated the loadCsv method to return something.
data_runner almost 2 years

spent 3+ hours, This was helpful, however, I just read directly from csv like thhis, using the WINDOWS encoding as was suggested pd.read_csv('data.csv', sep=';' , encoding='windows-1251')