Pandas read_csv fills empty values with string 'nan', instead of parsing date

python date csv pandas missing-data

16,412

Solution 1

This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

In [22]: df
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       NaN  d
4      6  2013-3-1  d

In [23]: df.dtypes
Out[23]: 
value     int64
date     object
id       object
dtype: object

In [24]: pd.to_datetime(df['date'])
Out[24]: 
0   2013-03-01 00:00:00
1   2013-03-01 00:00:00
2   2013-03-01 00:00:00
3                   NaT
4   2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]

If the string 'nan' acutally appears in your data, you can do this:

In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])

In [32]: s
Out[32]: 
0    2013-1-1
1    2013-1-1
2         nan
3    2013-1-1
dtype: object

In [39]: s[s=='nan'] = np.nan

In [40]: s
Out[40]: 
0    2013-1-1
1    2013-1-1
2         NaN
3    2013-1-1
dtype: object

In [41]: pandas.to_datetime(s)
Out[41]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:00
2                   NaT
3   2013-01-01 00:00:00
dtype: datetime64[ns]

Solution 2

You can pass the na_values=["nan"] parameter in your read_csv function call. That will read the string nan values and convert them to the proper np.nan format.

See here for more info.

16,412

Author by

Admin

Updated on July 01, 2022

Comments

Admin almost 2 years
I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example:
```
In [13]: df
Out[13]: 
   index  value date
0    975  25.35  nan
1    976  26.28  nan
2    977  26.24  nan
3    978  25.76  nan
4    979  26.08  nan

In [14]: df.date.isnull()
Out[14]: 
0    False
1    False
2    False
3    False
4    False
```
Am I doing anything wrong? Should I assign some other values instead of np.nan to the missing values so that the isnull() would be able to pick up?

EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN.

EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.
```
In [21]: data = pd.read_csv('test.csv', parse_dates = [1])

In [22]: data
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [23]: data.date[3]
Out[23]: 'nan'
```
pd.to_datetime does not work either:
```
In [12]: data
Out[12]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [13]: data.dtypes
Out[13]: 
value     int64
date     object
id       object

In [14]: pd.to_datetime(data['date'])
Out[14]: 
0    2013-3-1
1    2013-3-1
2    2013-3-1
3         nan
4    2013-3-1
Name: date
```
Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?
Admin about 11 years

Sorry maybe I didn't explain clearly. I am not trying to categorize the string 'nan' as NaN. What I am saying is, read_csv reads the empty values in the csv file into the string 'nan' as supposed to NaN. If I open the csv file with a text editor, there is nothing between the two commas.
bdiamante about 11 years

Try na_values=['nan', ''] This should read both string nan and blank string values in as np.nan.
Admin about 11 years

This still does not work. I think the na_values option does not apply to the columns that is being parsed as dates. The problem is really parse_dates does not work for columns with missing values.
Admin about 11 years

Does to_datetime work with the string 'nan'? It still does not work for me. It looks like your df.date already contains a valid NaN, while read_csv gives me a string 'nan'. Please see my edit. Thanks.
Jeff about 11 years

try with updated solution (this is somewhat manual), but with na_values=['nan'] as passed to read_csv you could achieve this pretty easy
Admin about 11 years

I consider doing this manually. But the fundamental issue is that if you ask read_csv to parse a column as date and that column contains missing values, read_csv would not parse the dates and put a string 'nan' in place of the missing values. Therefore, na_values=['nan'] will do nothing becuase the 'nan' is not present in the original csv file, as your update implies.