Remove outliers from correlation coefficient calculation

24,381

Solution 1

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

The plot produced from the dummy data showing the selected points with the smallest residuals

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

Solution 2

Using method = "spearman" in cor will be robust to contamination and is easy to implement since it only involves replacing cor(x, y) with cor(x, y, method = "spearman").

Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:

set.seed(1)

# x and y are uncorrelated
x <- rnorm(1000)
y <- rnorm(1000)
cor(x,y)
## [1] 0.006401211

# add contamination -- now cor says they are highly correlated
x <- c(x, 500)
y <- c(y, 500)
cor(x, y)
## [1] 0.995741

# but with method = "spearman" contamination is removed & they are shown to be uncorrelated
cor(x, y, method = "spearman")
## [1] -0.007270813

Solution 3

This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:

set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211

Now let's add an outlier point (500,500):

x <- c(x, 500)
y <- c(y, 500)

Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,

> cor(x,y)
[1] 0.995741

If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:

require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000

You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.

Solution 4

Here's another possibility with the outliers captured. Using a similar scheme as Prasad:

library(mvoutlier)    
set.seed(1)    
x <- rnorm(1000)    
y <- rnorm(1000)    
xy <- cbind(x, y)    
outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
cor.plot(x, y)    
color.plot(xy)   
dd.plot(xy)   
uni.plot(xy)    

In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.

x1 <- c(x, 4)     
y1 <- c(y, 4)    
xy1 <- cbind(x1, y1)    
outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
cor.plot(x1, y1)    
color.plot(xy1)    
dd.plot(xy1)    
uni.plot(xy1)    

Here are the images from the x1, y1, xy1 data:

alt text

alt text

alt text

Solution 5

You might try bootstrapping your data to find the highest correlation coefficient, e.g.:

x <- cars$dist
y <- cars$speed
percent <- 0.9         # given in the question above
n <- 1000              # number of resampling
boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})

And after run max(boot.cor). Do not be dissapointed if all the correlation coefficients will be all the same :)

Share:
24,381
Leo
Author by

Leo

Updated on March 21, 2020

Comments

  • Leo
    Leo about 4 years

    Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

    cor(x, y)

    How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient?

  • Leo
    Leo over 13 years
    Thanks a lot for such an excellent answer! The reason I want to do this is the following. I'm benchmarking various methods for predicting experimental observations (changes in binding energy upon mutation of a protein complex) based on experimental structures of the complexes. The target values come from various sources with varying quality. And errors in the structures can severely impact the predictions. So I have several outliers, but looking at a "pruned" correlation for various methods will allow me to more easily select the method that works best for the favorable cases.
  • bill_080
    bill_080 over 13 years
    I e-mailed the maintainer for mvoutlier about the problem I was having with alpha in the above aq.plot() statements. He has since fixed the problem and updated mvoutlier to version 1.6 (updated Jan 14, 2011) cran.r-project.org/web/packages/mvoutlier/index.html
  • cashoes
    cashoes over 9 years
    spearman will be robust to some types of contamination, namely single high value points being perfectly correlated resulting in an inflated pearson correlation. It won't be completely robust to contamination by outliers at the lower end of the scale though.