Missing values in Time Series in python

python pandas nan imputation

35,114

Solution 1

Consider interpolate (Series - DataFrame). This example shows how to fill gaps of any size with a straight line:

df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
                 date  value
0 2013-01-01 00:00:00    0.0
1 2013-01-01 01:00:00    1.0
2 2013-01-01 02:00:00    NaN
3 2013-01-01 03:00:00    NaN
4 2013-01-01 04:00:00    4.0
5 2013-01-01 05:00:00    5.0
6 2013-01-01 06:00:00    NaN
7 2013-01-01 07:00:00    7.0
8 2013-01-01 08:00:00    8.0
9 2013-01-01 09:00:00    9.0

df['value'].interpolate(method='linear', inplace=True)
                 date  value
0 2013-01-01 00:00:00    0.0
1 2013-01-01 01:00:00    1.0
2 2013-01-01 02:00:00    2.0
3 2013-01-01 03:00:00    3.0
4 2013-01-01 04:00:00    4.0
5 2013-01-01 05:00:00    5.0
6 2013-01-01 06:00:00    6.0
7 2013-01-01 07:00:00    7.0
8 2013-01-01 08:00:00    8.0
9 2013-01-01 09:00:00    9.0

Solution 2

Interpolate & Filna :

Since it's Time series Question I will use o/p graph images in the answer for the explanation purpose:

Consider we are having data of time series as follows: (on x axis= number of days, y = Quantity)

pdDataFrame.set_index('Dates')['QUANTITY'].plot(figsize = (16,6))

We can see there is some NaN data in time series. % of nan = 19.400% of total data. Now we want to impute null/nan values.

I will try to show you o/p of interpolate and filna methods to fill Nan values in the data.

interpolate() :

1st we will use interpolate:

pdDataFrame.set_index('Dates')['QUANTITY'].interpolate(method='linear').plot(figsize = (16,6))

NOTE: There is no time method in interpolate here

fillna() with backfill method

pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=None, downcast=None).plot(figsize = (16,6))

fillna() with backfill method & limit = 7

limit: this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled.

pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=7, downcast=None).plot(figsize = (16,6))

I find fillna function more useful. But you can use any one of the methods to fill up nan values in both the columns.

For more details about these functions refer following links:

There is one more Lib: impyute that you can check out. For more details regarding this lib refer this link: https://pypi.org/project/impyute/

Solution 3

You could use rolling like this:

frame = pd.DataFrame({'Humidity':np.arange(50,64)})

frame.loc[[3,7,10,11],'Humidity'] = np.nan

frame.Humidity.fillna(frame.Humidity.rolling(4,min_periods=1).mean())

Output:

0     50.0
1     51.0
2     52.0
3     51.0
4     54.0
5     55.0
6     56.0
7     55.0
8     58.0
9     59.0
10    58.5
11    58.5
12    62.0
13    63.0
Name: Humidity, dtype: float64

35,114

Author by

Marco Miglionico

I am working as Artificial Intelligence (AI) Architect at Stanley Black & Decker Digital Accelerator in Atlanta. I have a double master degree in Computer Science, from Politecnico di Milano and The University Of Illinois at Chicago. My research interests are Machine Learning, Deep Learning and Computer Vision

Updated on November 20, 2020

Comments

Marco Miglionico over 3 years

I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error
Mohammad ElNesr over 5 years

Time series are not linear, consider the temperature over the year, it follows a sinusoidal motion, the value is affected by many factors 1. The seasonality, 2. The trend, 3. Other random factors. In 'R' there is a package called imputeTS which fo this, I don't know an equivalent python package.
Mohammad ElNesr over 5 years

I think we have to use the method ='time' instead of method ='linear'
Behrouz Beheshti over 4 years

Thanks @MohammadElNesr for your input. That helped me a lot.
Peter Leimbigler about 3 years

@MohammadElNesr, thanks for adding information about method='time'. I used the linear progression range(10) as a simple example, but any sequence of values would work to demonstrate that if the date series is uniformly spaced as it is in the original question, there will be no difference between method='linear' and method='time'.