Mahalanobis distance in R

23,028

Solution 1

How about using the mahalanobis function in the stats package:

 mahalanobis(x, center, cov, inverted = FALSE, ...)

Solution 2

I've been trying this out from the same website that you looked at and then stumbled upon this question. I managed to get the script to work, But I get a different result.

#WORKING EXAMPLE
#MAHALANOBIS DIST OF TWO MATRICES

#define matrix
mat1<-matrix(data=c(2,2,6,7,4,6,5,4,2,1,2,5,5,3,7,4,3,6,5,3),nrow=10)
mat2<-matrix(data=c(6,7,8,5,5,5,4,7,6,4),nrow=5)
#center data
mat1.1<-scale(mat1,center=T,scale=F)
mat2.1<-scale(mat2,center=T,scale=F)
#cov matrix
mat1.2<-cov(mat1.1,method="pearson")
mat2.2<-cov(mat2.1,method="pearson")
n1<-nrow(mat1)
n2<-nrow(mat2)
n3<-n1+n2
#pooled matrix
mat3<-((n1/n3)*mat1.2) + ((n2/n3)*mat2.2)
#inverse pooled matrix
mat4<-solve(mat3)
#mean diff
mat5<-as.matrix((colMeans(mat1)-colMeans(mat2)))
#multiply
mat6<-t(mat5) %*% mat4
#multiply
sqrt(mat6 %*% mat5)

I think the function mahalanobis() is used to compute mahalanobis distances between individuals (rows) in one matrix. The function pairwise.mahalanobis() from package(HDMD) can compare two or more matrices and give mahalanobis distances between the matrices.

Solution 3

You can wrap the function stats::mahalanobis as bellow to output a mahalanobis distance matrix (pairwise mahalanobis distances):

# x - data frame
# cx - covariance matrix; if not provided, 
#      it will be estimated from the data
mah <- function(x, cx = NULL) {
  if(is.null(cx)) cx <- cov(x)
  out <- lapply(1:nrow(x), function(i) {
    mahalanobis(x = x, 
                center = do.call("c", x[i, ]),
                cov = cx)
  })
  return(as.dist(do.call("rbind", out)))
}

Then, you can cluster your data and plot it, for example:

# Dummy data
x <- data.frame(X = c(rnorm(10, 0), rnorm(10, 5)), 
                Y = c(rnorm(10, 0), rnorm(10, 7)), 
                Z = c(rnorm(10, 0), rnorm(10, 12)))
rownames(x) <- LETTERS[1:20]
plot(x, pch = LETTERS[1:20])

enter image description here

# Comute the mahalanobis distance matrix
d <- mah(x)
d

# Cluster and plot
hc <- hclust(d)
plot(hc)

enter image description here

Solution 4

Your output before taking the square root is :

inversepooledcov %*% t(meandiffmatrix) %*% meandiffmatrix
          [,1]        [,2]
x -0.004349227 -0.01304768
y  0.114529639  0.34358892

I think you can'take the square root of negative numbers number, so you have NAN for negative elements:

 sqrt(inversepooledcov %*% t(meandiffmatrix) %*% meandiffmatrix)
       [,1]      [,2]
x       NaN       NaN
y 0.3384223 0.5861646

Warning message:
In sqrt(inversepooledcov %*% t(meandiffmatrix) %*% meandiffmatrix) :
  NaNs produced

Solution 5

Mahalanobis distance is equivalent to (squared) Euclidean distance if the covariance matrix is identity. If you have covariance between your variables, you can make Mahalanobis and sq Euclidean equal by whitening the matrix first to remove the covariance. I.e., do:

#X is your matrix
if (!require("whitening")) install.packages("whitening")

X <- whitening::whiten(X) # default is ZCA (Mahalanobis) whitening
X_dist <- dist(X, diag = T, method = "euclidean")^2

You can confirm that this gives you the same distance matrix as the code that Davit provided in one of the previous answers.

Share:
23,028
Carson
Author by

Carson

Remote Sensing Specialist working as a contractor for the Remote Sensing Applications Center (USFS).

Updated on July 14, 2022

Comments

  • Carson
    Carson almost 2 years

    I have found the mahalanobis.dist function in package StatMatch (http://cran.r-project.org/web/packages/StatMatch/StatMatch.pdf) but it isn't doing exactly what I want. It seems to be calculating the mahalanobis distance from each observation in data.y to each observation in data.x

    I would like to calculate the mahalanobis distance of one observation in data.y to all observations in data.x. Basically calculate a mahalanobis distance of one point to a "cloud" of points if that makes sense. Kind of getting at the idea of the probability of an observation being part of another group of observations

    This person (http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html) seems to be doing this and I've tried to replicate his process in R but it is failing when I get to the bottom part of the equation:

    mahaldist = sqrt((inversepooledcov %*% t(meandiffmatrix)) %*% meandiffmatrix)
    

    All the code I am working with is here:

    a = rbind(c(2,2), c(2,5), c(6,5),c(7,3))
    
    colnames(a) = c('x', 'y')
    
    b = rbind(c(6,5),c(3,4))
    
    colnames(b) = c('x', 'y')
    
    acov = cov(a)
    bcov = cov(b)
    
    meandiff1 = mean(a[,1]) - mean(b[,1])
    
    meandiff2 = mean(a[,2]) - mean(b[,2])
    
    meandiffmatrix = rbind(c(meandiff1,meandiff2))
    
    totaldata = dim(a)[1] + dim(b)[1]
    
    pooledcov = (dim(a)[1]/totaldata * acov) + (dim(b)[1]/totaldata * bcov)
    
    inversepooledcov = solve(pooledcov)
    
    mahaldist = sqrt((inversepooledcov %*% t(meandiffmatrix)) %*% meandiffmatrix)
    
  • Carson
    Carson over 10 years
    I believe this function also calculates the mahalanobis distance from each observation in one matrix to each observation in another matrix. I could be wrong though
  • Carson
    Carson over 10 years
    So am doing something wrong here that I have negative numbers in my product right before I do sqrt()?
  • Mayou
    Mayou over 10 years
    This function computes the following: D^2 = (x - μ)' Σ^{-1} (x - μ). Therefore, it computes the distance of all points in X with respect to a common point μ. You could set μ to be the one observation in your data.y.
  • Carson
    Carson over 10 years
    I think I'm getting there, but I must still be doing something wrong. This is the code that I tried to replicate what you are saying. In the example below, "x" would be my cloud of points and "center" is the one observation I want to measure distance to. You can see that it still give distance to each observation in "x" instead of one distance.
  • Carson
    Carson over 10 years
    > x = rbind(c(2,2), c(2,5), c(6,5),c(7,3)) > center = rbind(c(6,5)) > s = cov(x) > mahal = mahalanobis(x,center,s) > mahal [1] 5.734657 2.339350 0.000000 2.052347
  • Metrics
    Metrics over 10 years
    I was just pointing out the error you got. There are some important caveats that you should know before you use, for example, the data should follow normal distribution
  • Mayou
    Mayou over 10 years
    If you use the mean difference as your center, and your pooled covariance as your cov, I think this should work the way you want it too. Wouldn't it?
  • Carson
    Carson over 10 years
    Does the below code show a correct example of calculating the mahalanobis distance between matrix "ma" and point "point"?
  • Carson
    Carson over 10 years
    > ma = cbind(1:6, 1:3) > point = c(3,3) > S = cov(ma) > center = colMeans(ma) > mahalanobis(point, center, S) [1] 2.083333