Smoothing online data on Python with savgol_filter from scipy.signal library

python numpy scipy smoothing

10,998

Solution 1

It looks like Kalman filters family are doing what I expect. This is because they are optimal in terms of "Mean square error". Implementation can be found here for example.

Solution 2

Thanks for updating your question!

The problem is that for your online_res approach you are missing parts of your data. The edge-values are being taken care of by scipy's savgol_filter, but not for your hand-coded version.

For your example have a look at the two results:

'online res': array([ 3.93, 3.17, 0.73, 0.2 , 1.11, 5.87, 6.37]))

'offline res': array([ 1.84, 3.52, 3.93, 3.17, 0.73, 0.2 , 1.11, 5.87, 6.37, 5.3, 1.84]))

They are identical, but offline res took care of the values data[0:2] and data[-2:]. In your case, where not specific mode is specified, it is set to the default of interpolate:

When the ‘interp’ mode is selected (the default), no extension is used. Instead, a degree polyorder polynomial is fit to the last window_length values of the edges, and this polynomial is used to evaluate the last window_length // 2 output values.

And THIS you did not do for your online res.

I implemented a simple polynomial fit for both sides and get the exact same results then:

from queue import Queue, Empty
import numpy as np
from scipy.signal import savgol_filter

window_size = 5
data = list()
q = Queue()
d = [2.22, 2.22, 5.55, 2.22, 1.11, 0.01, 1.11, 4.44, 9.99, 1.11, 3.33]
for i in d:
    q.put(i)

res = list()
while not q.empty():
    element = q.get()
    data.append(element)
    length = len(data)
    npd = np.array(data[length - window_size:])
    if length >= window_size:
        res.append(savgol_filter(npd, window_size, 2)[window_size//2])

# calculate the polynomial fit for elements 0,1,2,3,4
poly = np.polyfit(range(window_size), d[0:window_size], deg=2)
p = np.poly1d(poly)
res.insert(0, p(0)) # insert the polynomial fits at index 0 and 1
res.insert(1, p(1))

# calculate the polynomial fit for the 5 last elements (range runs like [4,3,2,1,0])
poly = np.polyfit(range(window_size-1, -1, -1), d[-window_size:], deg=2)
p = np.poly1d(poly)
res.append(p(1))
res.append(p(0))

npd = np.array(data)
res2 = savgol_filter(npd, window_size, 2)


diff = res - res2 # in your example you were calculating the wrong diff btw
np.set_printoptions(precision=2)
print('source data ', npd)
print('online res  ', np.array(res))
print('offline res ', res2)
print('error       ', diff.sum())

results in:

>>> Out: ('erorr   ', -7.9936057773011271e-15)

Edit: This version is independent of the d-list, meaning that it can digest whatever data it gets to grab from your source.

window_size = 5
half_window_size = window_size // 2 # this variable is used often
data = list()
q = Queue()
d = [2.22, 2.22, 5.55, 2.22, 1.11, 0.01, 1.11, 4.44, 9.99, 1.11, 3.33]
for i in d:
    q.put(i)  
res = [None]*window_size # create list of correct size instead of appending

while not q.empty():
    element = q.get()
    data.append(element)
    length = len(data)
    npd = np.array(data[length - window_size:])

    if length == window_size: # this is called only once, when reaching the filter-center
        # calculate the polynomial fit for elements 0,1,2,3,4
        poly = np.polyfit(range(window_size), data, deg=2)
        p = np.poly1d(poly)

        for poly_i in range(half_window_size): # independent from window_size
            res[poly_i] = p(poly_i) 

        # insert the sav_gol-value at index 2
        res[(length-1)-half_window_size] = savgol_filter(npd, window_size, 2)[half_window_size] 

        poly = np.polyfit(range(window_size - 1, -1, -1), data[-window_size:], deg=2)
        p = np.poly1d(poly)
        for poly_i_end in range(half_window_size):
            res[(window_size-1)-poly_i_end] = p(poly_i_end)

    elif length > window_size:
        res.append(None) # add another slot in the res-list
        # overwrite poly-value with savgol
        res[(length-1)-half_window_size] = savgol_filter(npd, window_size, 2)[half_window_size] 

        # extrapolate again into the future
        poly = np.polyfit(range(window_size - 1, -1, -1), data[-window_size:], deg=2)
        p = np.poly1d(poly)
        for poly_i_end in range(half_window_size):
            res[-poly_i_end-1] = p(poly_i_end)

10,998

Prokhozhii

Updated on June 04, 2022

Comments

Prokhozhii almost 2 years

I'd like to filter online data with savgol_filter from scipy.signal library. But when I was trying to use it for online data (when new elements appear one by one) I realized that savgol_filter works with online data with some delay (window_length//2) in comparison to how it works with offline data (their elements are available for calculation all at once). I use code similar to that (see below please)

from queue import Queue, Empty
import numpy as np
from scipy.signal import savgol_filter

window_size = 5
data = list()
q = Queue()
d = [2.22, 2.22, 5.55, 2.22, 1.11, 0.01, 1.11, 4.44, 9.99, 1.11, 3.33]
for i in d:
    q.put(i)

res = list()
while not q.empty():
    element = q.get()
    data.append(element)
    length = len(data)
    npd = np.array(data[length - window_size:])
    if length >= window_size:
        res.append(savgol_filter(npd , window_size, 2)[window_size // 2])

npd = np.array(data)
res2 = savgol_filter(npd , window_size, 2)

np.set_printoptions(precision=2)
print('source data ', npd)
print('online res  ', np.array(res))
print('offline res ', res2)

Am I right in my assumption? Can it be corrected somehow?
If I am right could you please advice similar filter with no such issue in calculations?

Prokhozhii over 6 years

One more question. We are talking about online data and online filtering as well. So the script should know in advance that there are first five elements in whole online data sequence arrived and after calculates polynomial part for them. Correct? I mean the original filter making calculation is informed about all available elements. But online filter don't have such info. It does not know was it first portion of data or last one. And because of this some delay in calculations is possible.
offeltoffel over 6 years

I had to assume that there is a distinct start and end for your dataset. You can smooth your data "on the run" (i.e. while new data appears). The first elements outside your moving window are no problem: when length=windowsize you calculate the polynomial fit and insert them in your res list. But for the finish you will have to wait until no more data comes in. For a window-size of 5 that means that your filter is always 3 data points behind the online-data appearence, because it needs to "look into the future"
Prokhozhii over 6 years

Well, that was the reason of my original question. How to avoid this? Because in real world I have much bigger window size and essential delays in data filtering are critical. Maybe there is similar filter without such issue.
offeltoffel over 6 years

There are different kinds of filters indeed. The example from your question has a moving window that is centered (looks into the past and into the future). You can also use one-sided filters to only look into the past and smooth according to the past 5 data points. That will, however, cause a bias. Imagine stock exchange as an example: central filters will erase peaks and valleys from the data, as they look to what happened before and afterwards. If you only look into the past, then that means your current data point is being corrected with regard to the past couple of hours/days/...
offeltoffel over 6 years

A common practice is to just copy the central value and "predict" it into the future mode='nearest'. Or you calculate your polynomial fit according to the last known data and then extrapolating into the future. In both cases you have to deal with the fact that you are "creating" information which is not there.
offeltoffel over 6 years

Please find my updated answer which should (hopefully) give you a working code for your problem.
Prokhozhii over 6 years

I'm sorry. Just checked you second edit of the code again. The value for the 5th element changes several times during calculations. It is unacceptable for online data processing. We can't get back and see what was before.
offeltoffel over 6 years

Let us continue this discussion in chat.
dennis-w almost 6 years

For an accepted answer the text should contain a reason why the kalman filters are the solution and propably some implementation notes.
Prokhozhii almost 6 years

Corrected the answer