How to read UTF-8 files with Pandas?

132,918

Solution 1

As the other poster mentioned, you might try:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

df.apply(lambda x: pd.lib.infer_dtype(x.values))

Example output:

args            unicode
date         datetime64
host            unicode
kwargs          unicode
operation       unicode

Solution 2

Use the encoding keyword with the appropriate parameter:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

Solution 3

Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object).

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0]) (I guess the first col consists of strings.)

Solution 4

Looks like the location of this function has moved. This worked for me on 1.0.1:

df.apply(lambda x: pd.api.types.infer_dtype(x.values))
Share:
132,918

Related videos on Youtube

Istvan
Author by

Istvan

Hands On Data & Cloud Architect with Leadership Experience. Things I care about: Systems Engineering Data Engineering Machine Learning Functional Programming

Updated on February 17, 2021

Comments

  • Istvan
    Istvan about 3 years

    I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:

    # file 1459966468_324.csv
    #1459966468_324.csv: UTF-8 Unicode English text
    df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
    df.dtypes
    text               object
    Airline            object
    name               object
    retweet_count     float64
    sentiment          object
    tweet_location     object
    dtype: object
    

    What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

    This does not solve the problem:

    df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
    df.apply(lambda x: pd.lib.infer_dtype(x.values))
    

    Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

    • Padraic Cunningham
      Padraic Cunningham about 8 years
    • Padraic Cunningham
      Padraic Cunningham about 8 years
      Using df.apply(lambda x: pd.lib.infer_dtype(x.values)) does show types as unicode and mixed, if you look at the link above you will see what is happening
  • ayhan
    ayhan about 8 years
    It's because you have nan values. Try df.dropna(subset=["text"], inplace=True) first then Sam's suggestion will convert the text column to unicode in the file you provided.
  • rasen58
    rasen58 about 2 years
    Can someone explain why you even need to provide the encoding argument for pandas? Looking at the docs, it says the field is optional, so I assume it would default to the python3 default of utf8 but it doesn't seem like that's true pandas.pydata.org/docs/reference/api/pandas.read_csv.html
  • BLT
    BLT about 2 years
    Perhaps it's a recent update---I had to add ._libs. So, df.apply(lambda x: pd._libs.lib.infer_dtype(x.values))