Python pairwise comparison of elements in a array or list
Solution 1
Perhaps you want:
[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
This will not compare any item against itself, but compare each of the others against each other.
Solution 2
You could use NumPy broadcasting
-
# Get the mask of comparisons in a vectorized manner using broadcasting
mask = a[:,None] >= a
# Select the elements other than diagonal ones
out = mask[~np.eye(a.size,dtype=bool)]
If you rather prefer to set the diagonal elements as False
in mask
and then mask
would be the output, like so -
mask[np.eye(a.size,dtype=bool)] = 0
Sample run -
In [56]: a
Out[56]: array([3, 7, 5, 8])
In [57]: mask = a[:,None] >= a
In [58]: mask
Out[58]:
array([[ True, False, False, False],
[ True, True, True, False],
[ True, False, True, False],
[ True, True, True, True]], dtype=bool)
In [59]: mask[~np.eye(a.size,dtype=bool)] # Selecting non-diag elems
Out[59]:
array([False, False, False, True, True, False, True, False, False,
True, True, True], dtype=bool)
In [60]: mask[np.eye(a.size,dtype=bool)] = 0 # Setting diag elems as False
In [61]: mask
Out[61]:
array([[False, False, False, False],
[ True, False, True, False],
[ True, False, False, False],
[ True, True, True, False]], dtype=bool)
Runtime test
Reasons to use NumPy broadcasting
? Performance! Let's see how with a large dataset -
In [34]: def pairwise_comp(A): # Using NumPy broadcasting
...: a = np.asarray(A) # Convert to array if not already so
...: mask = a[:,None] >= a
...: out = mask[~np.eye(a.size,dtype=bool)]
...: return out
...:
In [35]: a = np.random.randint(0,9,(1000)).tolist() # Input list
In [36]: %timeit [x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
1 loop, best of 3: 185 ms per loop # @Sixhobbits's loopy soln
In [37]: %timeit pairwise_comp(a)
100 loops, best of 3: 5.76 ms per loop
Solution 3
I'd like to apply @Divakar's solution to pandas objects. Here are two approaches for calculating pairwise absolute differences.
(IPython 6.1.0 on Python 3.6.2)
In [1]: import pandas as pd
...: import numpy as np
...: import itertools
In [2]: n = 256
...: labels = range(n)
...: ser = pd.Series(np.random.randn(n), index=labels)
...: ser.head()
Out[2]:
0 1.592248
1 -1.168560
2 -1.243902
3 -0.133140
4 -0.714133
dtype: float64
Loops
In [3]: %%time
...: result = dict()
...: for pair in itertools.combinations(labels, 2):
...: a, b = pair
...: a = ser[a] # retrieve values
...: b = ser[b]
...: result[pair] = a - b
...: result = pd.Series(result).abs().reset_index()
...: result.columns = list('ABC')
...: df1 = result.pivot('A', 'B, 'C').reindex(index=labels, columns=labels)
...: df1 = df1.fillna(df1.T).fillna(0.)
CPU times: user 18.2 s, sys: 468 ms, total: 18.7 s
Wall time: 18.7 s
NumPy broadcast
In [4]: %%time
...: arr = ser.values
...: arr = arr[:, None] - arr
...: df2 = pd.DataFrame(arr, labels, labels).abs()
CPU times: user 816 µs, sys: 432 µs, total: 1.25 ms
Wall time: 675 µs
Verify they're equal:
In [5]: df1.equals(df2)
Out[5]: True
Using loops is about 20000 times slower than the clever NumPy approach. NumPy has many optimizations, but sometimes they need a different way of thinking. :-)
Solution 4
You may achieve that by using:
[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
Issue with your code:
You are iterating in over list twice. If you convert your comprehension
to loop
, it will work like:
for x in a:
for y in a:
x>=y # which is your condition
Hence, it the the order of execution is as: (a1, a1), (a1, a2), ... , (a2, a1), (a2, a2), ... , (a4, a4)
Solution 5
Why are you worried about the a1>=a1
comparison. It may be pridictable, but skipping it might not be worth the extra work.
Make a list of 100 numbers
In [17]: a=list(range(100))
Compare them with the simple double loop; producing a 10000 values (100*100)
In [18]: len([x>=y for x in a for y in a])
Out[18]: 10000
In [19]: timeit [x>=y for x in a for y in a]
1000 loops, best of 3: 1.04 ms per loop
Now use @Moinuddin Quadri's
enumerated loop to skip the 100 eye
values:
In [20]: len([x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j])
Out[20]: 9900
In [21]: timeit [x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j]
100 loops, best of 3: 2.12 ms per loop
It takes 2x longer. Half the extra time is the enumerates, and half the if
.
In this case working with numpy arrays is much faster, even when including the time to create the array.
xa = np.array(x); Z = xa[:,None]>=xa
But you can't get rid of the the diagonal values. They will the True
; they can be flipped to False
, but why. In a boolean array there are only 2 values.
The fastest solution is to write an indicator function that isn't bothered by these diagonal values.
Sue
Updated on June 15, 2022Comments
-
Sue almost 2 years
Let me elaborate my question using a simple example.I have a=[a1,a2,a3,a4], with all ai being a numerical value.
What I want to get is pairwise comparisons within 'a', such as I(a1>=a2), I(a1>=a3), I(a1>=a4), ,,,,I(a4>=a1), I(a4>=a2), I(a4>=a3), where I is a indicator function. So I used the following code.
res=[x>=y for x in a for y in a]
But it also gives the comparison results like I(a1>=a1),..,I(a4>=a4), which is always one. To get rid of these nuisance, I convert res into a numpy array and find the off diagonal elements.
res1=numpy.array(res)
This gives the result what I want, but I think there should be more efficient or simpler way to do pairwise comparison and extract the off diagonal element. Do you have any idea about this? Thanks in advance.
-
Jean-François Fabre over 7 yearswhat about
[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i > j]
-
TemporalWolf over 7 years@Jean-FrançoisFabre OP wants the reverse results as well:
I(a1>=a4)
andI(a4>=a1)
-
Sue over 7 yearsYes. I need both.
-
Sue over 7 yearsLooks great! appreciate it.