How to read UTF-8 files with Pandas?
Solution 1
As the other poster mentioned, you might try:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Example output:
args unicode
date datetime64
host unicode
kwargs unicode
operation unicode
Solution 2
Use the encoding
keyword with the appropriate parameter:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
Solution 3
Pandas stores strings in object
s. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object
).
If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8')
. Then try for example pd.lib.infer_dtype(df.iloc[0,0])
(I guess the first col consists of strings.)
Solution 4
Looks like the location of this function has moved. This worked for me on 1.0.1:
df.apply(lambda x: pd.api.types.infer_dtype(x.values))
Related videos on Youtube
Istvan
Hands On Data & Cloud Architect with Leadership Experience. Things I care about: Systems Engineering Data Engineering Machine Learning Functional Programming
Updated on February 17, 2021Comments
-
Istvan about 3 years
I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:
# file 1459966468_324.csv #1459966468_324.csv: UTF-8 Unicode English text df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode}) df.dtypes text object Airline object name object retweet_count float64 sentiment object tweet_location object dtype: object
What is the right way of reading and coercing UTF-8 data into unicode with Pandas?
This does not solve the problem:
df = pd.read_csv('1459966468_324.csv', encoding = 'utf8') df.apply(lambda x: pd.lib.infer_dtype(x.values))
Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv
-
Padraic Cunningham about 8 years
-
Padraic Cunningham about 8 yearsUsing
df.apply(lambda x: pd.lib.infer_dtype(x.values))
does show types as unicode and mixed, if you look at the link above you will see what is happening
-
-
ayhan about 8 yearsIt's because you have
nan
values. Trydf.dropna(subset=["text"], inplace=True)
first then Sam's suggestion will convert the text column to unicode in the file you provided. -
rasen58 about 2 yearsCan someone explain why you even need to provide the
encoding
argument for pandas? Looking at the docs, it says the field is optional, so I assume it would default to the python3 default of utf8 but it doesn't seem like that's true pandas.pydata.org/docs/reference/api/pandas.read_csv.html -
BLT about 2 yearsPerhaps it's a recent update---I had to add
._libs
. So,df.apply(lambda x: pd._libs.lib.infer_dtype(x.values))