Pandas read_csv fills empty values with string 'nan', instead of parsing date

16,412

Solution 1

This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

In [22]: df
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       NaN  d
4      6  2013-3-1  d

In [23]: df.dtypes
Out[23]: 
value     int64
date     object
id       object
dtype: object

In [24]: pd.to_datetime(df['date'])
Out[24]: 
0   2013-03-01 00:00:00
1   2013-03-01 00:00:00
2   2013-03-01 00:00:00
3                   NaT
4   2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]

If the string 'nan' acutally appears in your data, you can do this:

In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])

In [32]: s
Out[32]: 
0    2013-1-1
1    2013-1-1
2         nan
3    2013-1-1
dtype: object

In [39]: s[s=='nan'] = np.nan

In [40]: s
Out[40]: 
0    2013-1-1
1    2013-1-1
2         NaN
3    2013-1-1
dtype: object

In [41]: pandas.to_datetime(s)
Out[41]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:00
2                   NaT
3   2013-01-01 00:00:00
dtype: datetime64[ns]

Solution 2

You can pass the na_values=["nan"] parameter in your read_csv function call. That will read the string nan values and convert them to the proper np.nan format.

See here for more info.

Share:
16,412
Admin
Author by

Admin

Updated on July 01, 2022

Comments

  • Admin
    Admin almost 2 years

    I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example:

    In [13]: df
    Out[13]: 
       index  value date
    0    975  25.35  nan
    1    976  26.28  nan
    2    977  26.24  nan
    3    978  25.76  nan
    4    979  26.08  nan
    
    In [14]: df.date.isnull()
    Out[14]: 
    0    False
    1    False
    2    False
    3    False
    4    False
    

    Am I doing anything wrong? Should I assign some other values instead of np.nan to the missing values so that the isnull() would be able to pick up?

    EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN.

    EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.

    In [21]: data = pd.read_csv('test.csv', parse_dates = [1])
    
    In [22]: data
    Out[22]: 
       value      date id
    0      2  2013-3-1  a
    1      3  2013-3-1  b
    2      4  2013-3-1  c
    3      5       nan  d
    4      6  2013-3-1  d
    
    In [23]: data.date[3]
    Out[23]: 'nan'
    

    pd.to_datetime does not work either:

    In [12]: data
    Out[12]: 
       value      date id
    0      2  2013-3-1  a
    1      3  2013-3-1  b
    2      4  2013-3-1  c
    3      5       nan  d
    4      6  2013-3-1  d
    
    In [13]: data.dtypes
    Out[13]: 
    value     int64
    date     object
    id       object
    
    In [14]: pd.to_datetime(data['date'])
    Out[14]: 
    0    2013-3-1
    1    2013-3-1
    2    2013-3-1
    3         nan
    4    2013-3-1
    Name: date
    

    Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?

  • Admin
    Admin about 11 years
    Sorry maybe I didn't explain clearly. I am not trying to categorize the string 'nan' as NaN. What I am saying is, read_csv reads the empty values in the csv file into the string 'nan' as supposed to NaN. If I open the csv file with a text editor, there is nothing between the two commas.
  • bdiamante
    bdiamante about 11 years
    Try na_values=['nan', ''] This should read both string nan and blank string values in as np.nan.
  • Admin
    Admin about 11 years
    This still does not work. I think the na_values option does not apply to the columns that is being parsed as dates. The problem is really parse_dates does not work for columns with missing values.
  • Admin
    Admin about 11 years
    Does to_datetime work with the string 'nan'? It still does not work for me. It looks like your df.date already contains a valid NaN, while read_csv gives me a string 'nan'. Please see my edit. Thanks.
  • Jeff
    Jeff about 11 years
    try with updated solution (this is somewhat manual), but with na_values=['nan'] as passed to read_csv you could achieve this pretty easy
  • Admin
    Admin about 11 years
    I consider doing this manually. But the fundamental issue is that if you ask read_csv to parse a column as date and that column contains missing values, read_csv would not parse the dates and put a string 'nan' in place of the missing values. Therefore, na_values=['nan'] will do nothing becuase the 'nan' is not present in the original csv file, as your update implies.