Optimal way to compute pairwise mutual information using numpy
I can't suggest a faster calculation for the outer loop over the n*(n-1)/2
vectors, but your implementation of calc_MI(x, y, bins)
can be simplified
if you can use scipy version 0.13 or scikit-learn.
In scipy 0.13, the lambda_
argument was added to scipy.stats.chi2_contingency
This argument controls the statistic that is computed by the function. If
you use lambda_="log-likelihood"
(or lambda_=0
), the log-likelihood ratio
is returned. This is also often called the G or G2 statistic. Other than
a factor of 2*n (where n is the total number of samples in the contingency
table), this is the mutual information. So you could implement calc_MI
as:
from scipy.stats import chi2_contingency
def calc_MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
mi = 0.5 * g / c_xy.sum()
return mi
The only difference between this and your implementation is that this
implementation uses the natural logarithm instead of the base-2 logarithm
(so it is expressing the information in "nats" instead of "bits"). If
you really prefer bits, just divide mi
by log(2).
If you have (or can install) sklearn
(i.e. scikit-learn), you can use
sklearn.metrics.mutual_info_score
, and implement calc_MI
as:
from sklearn.metrics import mutual_info_score
def calc_MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
mi = mutual_info_score(None, None, contingency=c_xy)
return mi
nahsivar
Updated on September 10, 2021Comments
-
nahsivar over 2 years
For an m x n matrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?
By mutual information, I mean:
I(X, Y) = H(X) + H(Y) - H(X,Y)
where H(X) refers to the Shannon entropy of X.
Currently I'm using
np.histogram2d
andnp.histogram
to calculate the joint (X,Y) and individual (X or Y) counts. For a given matrixA
(e.g. a 250000 X 1000 matrix of floats), I am doing a nestedfor
loop,n = A.shape[1] for ix = arange(n) for jx = arange(ix+1,n): matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])
Surely there must be better/faster ways to do this?
As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.
Here is my full implementation, following the conventions in the Wiki page:
import numpy as np def calc_MI(X,Y,bins): c_XY = np.histogram2d(X,Y,bins)[0] c_X = np.histogram(X,bins)[0] c_Y = np.histogram(Y,bins)[0] H_X = shan_entropy(c_X) H_Y = shan_entropy(c_Y) H_XY = shan_entropy(c_XY) MI = H_X + H_Y - H_XY return MI def shan_entropy(c): c_normalized = c / float(np.sum(c)) c_normalized = c_normalized[np.nonzero(c_normalized)] H = -sum(c_normalized* np.log2(c_normalized)) return H A = np.array([[ 2.0, 140.0, 128.23, -150.5, -5.4 ], [ 2.4, 153.11, 130.34, -130.1, -9.5 ], [ 1.2, 156.9, 120.11, -110.45,-1.12 ]]) bins = 5 # ? n = A.shape[1] matMI = np.zeros((n, n)) for ix in np.arange(n): for jx in np.arange(ix+1,n): matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)
Although my working version with nested
for
loops does it at reasonable speed, I'd like to know if there is a more optimal way to applycalc_MI
on all the columns ofA
(to calculate their pairwise mutual information)?I'd also like to know:
Whether there are efficient ways to map functions to operate on columns (or rows) of
np.arrays
(maybe likenp.vectorize
, which looks more like a decorator)?Whether there are other optimal implementations for this specific calculation (mutual information)?
-
pir over 8 yearsNice code! What is a reasonable default value for number of bins?
-
Warren Weckesser over 8 years@felbo That's a good question, and not one that can be easily answered. You might get some ideas if you ask it over at stats.stackexchange.com
-
Jon over 6 yearsThis method does not work if some counts equal zero. Why do you suggest this method over density estimation? Also, I upvoted your answer as it does provide an efficient way to calculate MI for some scenarios.
-
Warren Weckesser over 6 years"Why do you suggest this method over density estimation?" I didn't. I only suggested a couple alternative implementations of the code given in the question.
-
shouldsee about 6 yearsThe two suggested methods differs by continuity correction. Change to
chi2_contingency(correction = False)
removes this inconsistency. -
Josh.F over 5 yearsBeware of discretization (ie the bin sizes). This blog recommends the "Jack Knifed Estimate" to overcome this, or I would also expect density estimator techniques to help.