Reading russian language data from csv

22,449

Solution 1

\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.

In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html

They propose a simple work around using:

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

This would allow you to do:

def loadCsv(filename):
    lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
    # if you really need lists then uncomment the next line
    # this will let you do call exact lines by doing `line_12 = lines[12]`
    # return list(lines)

    # this will return an "iterator", so that the file is read on each call
    # use this if you'll do a `for x in x`
    return lines 

If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:

for csv_line in loadCsv("myfile.csv"):
    print u", ".join(csv_line)

If you need to write your results to another file (fairly typical), you could do:

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
    for csv_line in loadCsv("myfile.csv"):
        my_output.write(u", ".join(csv_line))

This will automatically convert/encode your output to UTF-8.

Solution 2

You cant try:

import pandas as pd 
pd.read_csv(path_file , "cp1251")

or

import csv
with open(path_file,  encoding="cp1251", errors='ignore') as source_file:
        reader = csv.reader(source_file, delimiter=",") 

Solution 3

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

Share:
22,449
Erba Aitbayev
Author by

Erba Aitbayev

IT specialist.

Updated on July 05, 2022

Comments

  • Erba Aitbayev
    Erba Aitbayev almost 2 years

    I have some data in CSV file that are in Russian:

    2-комнатная квартира РДТ',  мкр Тастак-3,  Аносова — Толе би;Алматы
    2-комнатная квартира БГР',  мкр Таугуль,  Дулати (Навои) — Токтабаева;Алматы
    2-комнатная квартира ЦФМ',  мкр Тастак-2,  Тлендиева — Райымбека;Алматы
    

    Delimiter is ; symbol.


    I want to read data and put it into array. I tried to read this data using this code:

    def loadCsv(filename):
        lines = csv.reader(open(filename, "rb"),delimiter=";" )
        dataset = list(lines)
        for i in range(len(dataset)):
            dataset[i] = [str(x) for x in dataset[i]]
        return dataset
    

    Then I read and print result:

    mydata = loadCsv('krish(csv3).csv')
    print mydata
    

    Output:

    [['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3,  \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc,  \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0,  \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2,  \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
    

    I found that in this case codecs are required and tried to do the same with this code:

    import codecs
    with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
        text = f.read()
    print text
    

    I got this error:

    newchars, decodedbytes = self.decode(data, self.errors)
    
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
    

    What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.

    • jfs
      jfs over 8 years
      return ([f.decode('cp1251') if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, "rb"),delimiter=";"))
  • Alastair McCormack
    Alastair McCormack over 8 years
    Sorry, see latest edit. Your code should use UnicodeReader()
  • Erba Aitbayev
    Erba Aitbayev over 8 years
    I have included UnicodeReader and UTF8Recoder in my code and tried to use loadCsv(). But data in dataset variable looks like this: u"2-\u043a\u043e\u043c\u043d\u0430\u0442. Is there something that I do wrong?
  • Alastair McCormack
    Alastair McCormack over 8 years
    No, that's fine. It's because you're printing the whole line, which is a list, so you get a "representation". What you're seeing is Unicode literals, which means your data has been correctly decoded. This is a good thing! :) Try doing print line[0], which will encode the Unicode values to your console's locale
  • Alastair McCormack
    Alastair McCormack over 8 years
    I've added some code to show how to iterate and join your results
  • Alastair McCormack
    Alastair McCormack over 8 years
    Ah, yes. I see why that would happen. I've updated the loadCsv method to return something.
  • data_runner
    data_runner almost 2 years
    spent 3+ hours, This was helpful, however, I just read directly from csv like thhis, using the WINDOWS encoding as was suggested pd.read_csv('data.csv', sep=';' , encoding='windows-1251')