Pandas read_csv fills empty values with string 'nan', instead of parsing date
Solution 1
This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1
In [22]: df
Out[22]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 NaN d
4 6 2013-3-1 d
In [23]: df.dtypes
Out[23]:
value int64
date object
id object
dtype: object
In [24]: pd.to_datetime(df['date'])
Out[24]:
0 2013-03-01 00:00:00
1 2013-03-01 00:00:00
2 2013-03-01 00:00:00
3 NaT
4 2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]
If the string 'nan' acutally appears in your data, you can do this:
In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])
In [32]: s
Out[32]:
0 2013-1-1
1 2013-1-1
2 nan
3 2013-1-1
dtype: object
In [39]: s[s=='nan'] = np.nan
In [40]: s
Out[40]:
0 2013-1-1
1 2013-1-1
2 NaN
3 2013-1-1
dtype: object
In [41]: pandas.to_datetime(s)
Out[41]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00
2 NaT
3 2013-01-01 00:00:00
dtype: datetime64[ns]
Solution 2
You can pass the na_values=["nan"]
parameter in your read_csv
function call. That will read the string nan values and convert them to the proper np.nan
format.
See here for more info.
Admin
Updated on July 01, 2022Comments
-
Admin almost 2 years
I assign
np.nan
to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string'nan'
instead of NaN. As a result,isnull()
does not work. For example:In [13]: df Out[13]: index value date 0 975 25.35 nan 1 976 26.28 nan 2 977 26.24 nan 3 978 25.76 nan 4 979 26.08 nan In [14]: df.date.isnull() Out[14]: 0 False 1 False 2 False 3 False 4 False
Am I doing anything wrong? Should I assign some other values instead of
np.nan
to the missing values so that theisnull()
would be able to pick up?EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be
NaN
.EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.
In [21]: data = pd.read_csv('test.csv', parse_dates = [1]) In [22]: data Out[22]: value date id 0 2 2013-3-1 a 1 3 2013-3-1 b 2 4 2013-3-1 c 3 5 nan d 4 6 2013-3-1 d In [23]: data.date[3] Out[23]: 'nan'
pd.to_datetime does not work either:
In [12]: data Out[12]: value date id 0 2 2013-3-1 a 1 3 2013-3-1 b 2 4 2013-3-1 c 3 5 nan d 4 6 2013-3-1 d In [13]: data.dtypes Out[13]: value int64 date object id object In [14]: pd.to_datetime(data['date']) Out[14]: 0 2013-3-1 1 2013-3-1 2 2013-3-1 3 nan 4 2013-3-1 Name: date
Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?
-
Admin about 11 yearsSorry maybe I didn't explain clearly. I am not trying to categorize the string 'nan' as NaN. What I am saying is, read_csv reads the empty values in the csv file into the string 'nan' as supposed to NaN. If I open the csv file with a text editor, there is nothing between the two commas.
-
bdiamante about 11 yearsTry
na_values=['nan', '']
This should read both string nan and blank string values in as np.nan. -
Admin about 11 yearsThis still does not work. I think the na_values option does not apply to the columns that is being parsed as dates. The problem is really parse_dates does not work for columns with missing values.
-
Admin about 11 yearsDoes to_datetime work with the string 'nan'? It still does not work for me. It looks like your df.date already contains a valid NaN, while read_csv gives me a string 'nan'. Please see my edit. Thanks.
-
Jeff about 11 yearstry with updated solution (this is somewhat manual), but with
na_values=['nan']
as passed to read_csv you could achieve this pretty easy -
Admin about 11 yearsI consider doing this manually. But the fundamental issue is that if you ask read_csv to parse a column as date and that column contains missing values, read_csv would not parse the dates and put a string 'nan' in place of the missing values. Therefore, na_values=['nan'] will do nothing becuase the 'nan' is not present in the original csv file, as your update implies.