How can I use numpy.correlate to do autocorrelation?

229,098

Solution 1

To answer your first question, numpy.correlate(a, v, mode) is performing the convolution of a with the reverse of v and giving the results clipped by the specified mode. The definition of convolution, C(t)=∑ -∞ < i < ∞ aivt+i where -∞ < t < ∞, allows for results from -∞ to ∞, but you obviously can't store an infinitely long array. So it has to be clipped, and that is where the mode comes in. There are 3 different modes: full, same, & valid:

  • "full" mode returns results for every t where both a and v have some overlap.
  • "same" mode returns a result with the same length as the shortest vector (a or v).
  • "valid" mode returns results only when a and v completely overlap each other. The documentation for numpy.convolve gives more detail on the modes.

For your second question, I think numpy.correlate is giving you the autocorrelation, it is just giving you a little more as well. The autocorrelation is used to find how similar a signal, or function, is to itself at a certain time difference. At a time difference of 0, the auto-correlation should be the highest because the signal is identical to itself, so you expected that the first element in the autocorrelation result array would be the greatest. However, the correlation is not starting at a time difference of 0. It starts at a negative time difference, closes to 0, and then goes positive. That is, you were expecting:

autocorrelation(a) = ∑ -∞ < i < ∞ aivt+i where 0 <= t < ∞

But what you got was:

autocorrelation(a) = ∑ -∞ < i < ∞ aivt+i where -∞ < t < ∞

What you need to do is take the last half of your correlation result, and that should be the autocorrelation you are looking for. A simple python function to do that would be:

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

You will, of course, need error checking to make sure that x is actually a 1-d array. Also, this explanation probably isn't the most mathematically rigorous. I've been throwing around infinities because the definition of convolution uses them, but that doesn't necessarily apply for autocorrelation. So, the theoretical portion of this explanation may be slightly wonky, but hopefully the practical results are helpful. These pages on autocorrelation are pretty helpful, and can give you a much better theoretical background if you don't mind wading through the notation and heavy concepts.

Solution 2

Auto-correlation comes in two versions: statistical and convolution. They both do the same, except for a little detail: The statistical version is normalized to be on the interval [-1,1]. Here is an example of how you do the statistical one:

def acf(x, length=20):
    return numpy.array([1]+[numpy.corrcoef(x[:-i], x[i:])[0,1]  \
        for i in range(1, length)])

Solution 3

I think there are 2 things that add confusion to this topic:

  1. statistical v.s. signal processing definition: as others have pointed out, in statistics we normalize auto-correlation into [-1,1].
  2. partial v.s. non-partial mean/variance: when the timeseries shifts at a lag>0, their overlap size will always < original length. Do we use the mean and std of the original (non-partial), or always compute a new mean and std using the ever changing overlap (partial) makes a difference. (There's probably a formal term for this, but I'm gonna use "partial" for now).

I've created 5 functions that compute auto-correlation of a 1d array, with partial v.s. non-partial distinctions. Some use formula from statistics, some use correlate in the signal processing sense, which can also be done via FFT. But all results are auto-correlations in the statistics definition, so they illustrate how they are linked to each other. Code below:

import numpy
import matplotlib.pyplot as plt

def autocorr1(x,lags):
    '''numpy.corrcoef, partial'''

    corr=[1. if l==0 else numpy.corrcoef(x[l:],x[:-l])[0][1] for l in lags]
    return numpy.array(corr)

def autocorr2(x,lags):
    '''manualy compute, non partial'''

    mean=numpy.mean(x)
    var=numpy.var(x)
    xp=x-mean
    corr=[1. if l==0 else numpy.sum(xp[l:]*xp[:-l])/len(x)/var for l in lags]

    return numpy.array(corr)

def autocorr3(x,lags):
    '''fft, pad 0s, non partial'''

    n=len(x)
    # pad 0s to 2n-1
    ext_size=2*n-1
    # nearest power of 2
    fsize=2**numpy.ceil(numpy.log2(ext_size)).astype('int')

    xp=x-numpy.mean(x)
    var=numpy.var(x)

    # do fft and ifft
    cf=numpy.fft.fft(xp,fsize)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real
    corr=corr/var/n

    return corr[:len(lags)]

def autocorr4(x,lags):
    '''fft, don't pad 0s, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean

    cf=numpy.fft.fft(xp)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real/var/len(x)

    return corr[:len(lags)]

def autocorr5(x,lags):
    '''numpy.correlate, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean
    corr=numpy.correlate(xp,xp,'full')[len(x)-1:]/var/len(x)

    return corr[:len(lags)]


if __name__=='__main__':

    y=[28,28,26,19,16,24,26,24,24,29,29,27,31,26,38,23,13,14,28,19,19,\
            17,22,2,4,5,7,8,14,14,23]
    y=numpy.array(y).astype('float')

    lags=range(15)
    fig,ax=plt.subplots()

    for funcii, labelii in zip([autocorr1, autocorr2, autocorr3, autocorr4,
        autocorr5], ['np.corrcoef, partial', 'manual, non-partial',
            'fft, pad 0s, non-partial', 'fft, no padding, non-partial',
            'np.correlate, non-partial']):

        cii=funcii(y,lags)
        print(labelii)
        print(cii)
        ax.plot(lags,cii,label=labelii)

    ax.set_xlabel('lag')
    ax.set_ylabel('correlation coefficient')
    ax.legend()
    plt.show()

Here is the output figure:

enter image description here

We don't see all 5 lines because 3 of them overlap (at the purple). The overlaps are all non-partial auto-correlations. This is because computations from the signal processing methods (np.correlate, FFT) don't compute a different mean/std for each overlap.

Also note that the fft, no padding, non-partial (red line) result is different, because it didn't pad the timeseries with 0s before doing FFT, so it's circular FFT. I can't explain in detail why, that's what I learned from elsewhere.

Solution 4

Use the numpy.corrcoef function instead of numpy.correlate to calculate the statistical correlation for a lag of t:

def autocorr(x, t=1):
    return numpy.corrcoef(numpy.array([x[:-t], x[t:]]))

Solution 5

As I just ran into the same problem, I would like to share a few lines of code with you. In fact there are several rather similar posts about autocorrelation in stackoverflow by now. If you define the autocorrelation as a(x, L) = sum(k=0,N-L-1)((xk-xbar)*(x(k+L)-xbar))/sum(k=0,N-1)((xk-xbar)**2) [this is the definition given in IDL's a_correlate function and it agrees with what I see in answer 2 of question #12269834], then the following seems to give the correct results:

import numpy as np
import matplotlib.pyplot as plt

# generate some data
x = np.arange(0.,6.12,0.01)
y = np.sin(x)
# y = np.random.uniform(size=300)
yunbiased = y-np.mean(y)
ynorm = np.sum(yunbiased**2)
acor = np.correlate(yunbiased, yunbiased, "same")/ynorm
# use only second half
acor = acor[len(acor)/2:]

plt.plot(acor)
plt.show()

As you see I have tested this with a sin curve and a uniform random distribution, and both results look like I would expect them. Note that I used mode="same" instead of mode="full" as the others did.

Share:
229,098
Ben
Author by

Ben

Updated on April 25, 2020

Comments

  • Ben
    Ben about 4 years

    I need to do auto-correlation of a set of numbers, which as I understand it is just the correlation of the set with itself.

    I've tried it using numpy's correlate function, but I don't believe the result, as it almost always gives a vector where the first number is not the largest, as it ought to be.

    So, this question is really two questions:

    1. What exactly is numpy.correlate doing?
    2. How can I use it (or something else) to do auto-correlation?
  • David Zwicker
    David Zwicker over 12 years
    In current builds of numpy, the mode 'same' can be specified to achieve exactly what the A. Levy proposed. The body of the function could then read return numpy.correlate(x, x, mode='same')
  • Developer
    Developer over 12 years
    @DavidZwicker but the resultings are different! np.correlate(x,x,mode='full')[len(x)//2:] != np.correlate(x,x,mode='same'). For example, x = [1,2,3,1,2]; np.correlate(x,x,mode='full'); {>>> array([ 2, 5, 11, 13, 19, 13, 11, 5, 2])} np.correlate(x,x,mode='same'); {>>> array([11, 13, 19, 13, 11])}. The correct one is: np.correlate(x,x,mode='full')[len(x)-1:]; {>>> array([19, 13, 11, 5, 2])} see the first item is the largest one.
  • luispedro
    luispedro about 11 years
    You want numpy.corrcoef[x:-i], x[i:])[0,1] in the second line as the return value of corrcoef is a 2x2 matrix
  • amcnabb
    amcnabb over 10 years
    Note that this answer gives the unnormalized autocorrelation.
  • Kuhlambo
    Kuhlambo over 6 years
    is it possible that there is something wrong with this? I can not match it's results with other auto correlation functions. The function looks similar but seems somewhat squished.
  • Ruggero
    Ruggero over 6 years
    @pindakaas could you be more specific? please provide information on what are the discrepancies you find, with which functions.
  • Daniel says Reinstate Monica
    Daniel says Reinstate Monica over 6 years
    Doesn't "correlation coefficients" refer to the autocorrelation used in signal processing and not the autocorrelation used to in statistics? en.wikipedia.org/wiki/Autocorrelation#Signal_processing
  • Ramón J Romero y Vigil
    Ramón J Romero y Vigil over 6 years
    @DanielPendergast I'm not as familiar with signal processing. From the numpy docs: "Return Pearson product-moment correlation coefficients.". Is that the signal processing version?
  • Daniel says Reinstate Monica
    Daniel says Reinstate Monica over 6 years
    What's the difference between the statistical and the convolutional autocorrelations?
  • dylnan
    dylnan over 6 years
    Why not use p = np.abs(f)?
  • Ruggero
    Ruggero over 6 years
    @dylnan That would give the modules of the components of f, whereas there we want here a vector containing the square modules of the components of f.
  • Ruggero
    Ruggero over 6 years
    @dylnan yes it would have worked but it would have unnecessarily calculated many square roots and then turned back to the square module.
  • n1k31t4
    n1k31t4 over 6 years
    @DanielPendergast: The second sentence answers that: They both do the same, except for a little detail: The former [statistical] is normalized to be on the interval [-1,1]
  • Jason
    Jason almost 6 years
    I think @Developer gives the correct slicing: [len(x)-1:] starts from the 0-lag. Because full mode gives a result size 2*len(x)-1, A.Levy's [result.size/2:] is the same as [len(x)-1:]. It's better to make it an int though, like [result.size//2:].
  • Jason
    Jason almost 6 years
    Yeah, but did you realize that doing list comprehension is probably even slower.
  • kevinkayaks
    kevinkayaks over 5 years
    I found it must be an int, at least in python 3.7
  • Qaswed
    Qaswed about 5 years
    Why not use autocorrelation_plot() in this case? (cf. stats.stackexchange.com/questions/357300/…)
  • Feodoran
    Feodoran over 4 years
    Your result is (roughly) equivalent to numpy.correlate(x, x, mode='same'), but what about the full function `numpy.correlate(x, x, mode='full')? Can we get this using FFT as well?
  • Feodoran
    Feodoran over 4 years
    BTW: np.abs(f)**2 is way faster then the list comprehension. In principle, avoiding the unnecessary squares by using f.real**2 + f.imag**2 is even faster, but the effect is negligible here because FFT is of course slower than some basic muliplication/addition.
  • Newskooler
    Newskooler about 4 years
    Why is it good to demean and work with unbiased data?
  • Sid
    Sid over 3 years
    @DanielsaysReinstateMonica Statistical Auto correlation means you can establish a statistical relation of the time series at a point of time to previous covariates, for example parametric ones like regression: $E(y_t|y_{t-k}) = fn(y_{t-k}, \beta)$ and $Var(y_t|y_{t-k}) = f(...) $ Since Expectation and Variance are inherently statistical quantities, thence the name. Convolution AC is typically used for Signal Processing for say smoothing/filtering, for example, by using a convolving(sliding) window, say Sin wave, over your signal & literally multiplying and summing it at each point as you go.