How to read UTF-8 files with Pandas?

python csv pandas utf-8

132,918

Solution 1

As the other poster mentioned, you might try:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

df.apply(lambda x: pd.lib.infer_dtype(x.values))

Example output:

args            unicode
date         datetime64
host            unicode
kwargs          unicode
operation       unicode

Solution 2

Use the encoding keyword with the appropriate parameter:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

Solution 3

Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object).

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0]) (I guess the first col consists of strings.)

Solution 4

Looks like the location of this function has moved. This worked for me on 1.0.1:

df.apply(lambda x: pd.api.types.infer_dtype(x.values))

View more solutions

132,918

Istvan

Hands On Data & Cloud Architect with Leadership Experience. Things I care about: Systems Engineering Data Engineering Machine Learning Functional Programming

Updated on February 17, 2021

Comments

Istvan about 3 years
I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:
```
# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text               object
Airline            object
name               object
retweet_count     float64
sentiment          object
tweet_location     object
dtype: object
```
What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

This does not solve the problem:
```
df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))
```
Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv
- Padraic Cunningham about 8 years
  
  stackoverflow.com/a/20670901/2141635
- Padraic Cunningham about 8 years
  
  Using df.apply(lambda x: pd.lib.infer_dtype(x.values)) does show types as unicode and mixed, if you look at the link above you will see what is happening
ayhan about 8 years

It's because you have nan values. Try df.dropna(subset=["text"], inplace=True) first then Sam's suggestion will convert the text column to unicode in the file you provided.
rasen58 about 2 years

Can someone explain why you even need to provide the encoding argument for pandas? Looking at the docs, it says the field is optional, so I assume it would default to the python3 default of utf8 but it doesn't seem like that's true pandas.pydata.org/docs/reference/api/pandas.read_csv.html
BLT about 2 years

Perhaps it's a recent update---I had to add ._libs. So, df.apply(lambda x: pd._libs.lib.infer_dtype(x.values))