Calculating Pearson correlation

28,778

Solution 1

Use scipy :

scipy.stats.pearsonr(x, y)

Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters :

x : 1D array

y : 1D array the same length as x

Returns :

(Pearson’s correlation coefficient, : 2-tailed p-value)

Solution 2

You can also use numpy:

numpy.corrcoef(x, y)

which would give you a correlation matrix that looks like:

[[1          correlation(x, y)]
[correlation(y, x)          1]]
Share:
28,778
user94628
Author by

user94628

Updated on October 14, 2021

Comments

  • user94628
    user94628 over 2 years

    I'm trying to calculate the Pearson correlation coefficient of two variables. These variables are to determine if there is a relationship between number of postal codes to a range of distances. So I want to see if the number of postal codes increases/decreases as the distance ranges changes.

    I'll have one list which will count the number of postal codes within a distance range and the other list will have the actual ranges.

    Is it ok to have a list that contain a range of distances? Or would it be better to have a list like this [50, 100, 500, 1000] where each element would then contain ranges up that amount. So for example the list represents up to 50km, then from 50km to 100km and so on.

  • user94628
    user94628 over 11 years
    Ok, so what matters more is that both the x and y arrays are of the same length. Then you are comparing elements x[i] with element y[i]?
  • lucasg
    lucasg over 11 years
    yep. In your case, x should be equal to the distances considered, and y[i] should return the number of postal code at distances[i]. To see the actual computation for the Pearson : stackoverflow.com/questions/3949226/…
  • user94628
    user94628 over 11 years
    Cool, so x[i] could mean up to that distance?
  • Antimony
    Antimony over 8 years
    Yes, x[i] could mean up to that distance. If all the distances are computed from a particular starting point, then x[i] is just an area of that distance, and the corresponding y[i] would be how many postal codes are covered in that area.
  • DollarAkshay
    DollarAkshay almost 6 years
    Make sure that the arrays x and y have a mean of 0. Otherwise you will get an incorrect value.