Replace NaN or missing values with rolling mean or other interpolation

25,957

Solution 1

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                  'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')

You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in @user394430's answer.)

df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()

Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).

df['update'] = df['rollmean3']
df['update'].update( df['temp'] )  # note: this is an inplace operation

There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

df['ffill']   = df['temp'].ffill()         # previous month 
df['bfill']   = df['temp'].bfill()         # next month
df['interp']  = df['temp'].interpolate()   # mean of prev/next

In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas

Here is the sample data with all the results:

       temp  rollmean12  rollmean3  update  ffill  bfill  interp
month                                                           
10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0

In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

Solution 2

The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be

data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Share:
25,957
Alexis Eggermont
Author by

Alexis Eggermont

Updated on August 07, 2020

Comments

  • Alexis Eggermont
    Alexis Eggermont almost 4 years

    I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using

    pd.rolling_mean(data["variable"]), 12, center=True)
    

    but it just gives me all NaN values.

    Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.

    The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

  • MyCarta
    MyCarta over 7 years
    I had a similar problem and the min_periods=1 is what solved it for me. +1 for suggesting the rolling object
  • JohnE
    JohnE almost 7 years
    @user394430 -- FYI, I substantially improved my answer, including your use of min_periods and also comparing the different ways of doing this. In my anwer the column labelled "rollmean12" should correspond to your answer.