Reading russian language data from csv
Solution 1
\ea
is the windows-1251 / cp5347 encoding for к
. Therefore, you need to use windows-1251
decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
This would allow you to do:
def loadCsv(filename):
lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
# if you really need lists then uncomment the next line
# this will let you do call exact lines by doing `line_12 = lines[12]`
# return list(lines)
# this will return an "iterator", so that the file is read on each call
# use this if you'll do a `for x in x`
return lines
If you try to print dataset
, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x
or \u
. To print the values, do:
for csv_line in loadCsv("myfile.csv"):
print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
for csv_line in loadCsv("myfile.csv"):
my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.
Solution 2
You cant try:
import pandas as pd
pd.read_csv(path_file , "cp1251")
or
import csv
with open(path_file, encoding="cp1251", errors='ignore') as source_file:
reader = csv.reader(source_file, delimiter=",")
Solution 3
Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.
Comments
-
Erba Aitbayev almost 2 years
I have some data in CSV file that are in Russian:
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is
;
symbol.
I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset
Then I read and print result:
mydata = loadCsv('krish(csv3).csv') print mydata
Output:
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
I found that in this case codecs are required and tried to do the same with this code:
import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text
I got this error:
newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.
-
jfs over 8 years
return ([f.decode('cp1251') if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, "rb"),delimiter=";"))
-
-
Alastair McCormack over 8 yearsSorry, see latest edit. Your code should use
UnicodeReader()
-
Erba Aitbayev over 8 yearsI have included UnicodeReader and UTF8Recoder in my code and tried to use loadCsv(). But data in dataset variable looks like this: u"2-\u043a\u043e\u043c\u043d\u0430\u0442. Is there something that I do wrong?
-
Alastair McCormack over 8 yearsNo, that's fine. It's because you're printing the whole line, which is a
list
, so you get a "representation". What you're seeing is Unicode literals, which means your data has been correctly decoded. This is a good thing! :) Try doingprint line[0]
, which will encode the Unicode values to your console's locale -
Alastair McCormack over 8 yearsI've added some code to show how to iterate and join your results
-
Alastair McCormack over 8 yearsAh, yes. I see why that would happen. I've updated the
loadCsv
method to return something. -
data_runner almost 2 yearsspent 3+ hours, This was helpful, however, I just read directly from csv like thhis, using the WINDOWS encoding as was suggested pd.read_csv('data.csv', sep=';' , encoding='windows-1251')