Python pairwise comparison of elements in a array or list

19,458

Solution 1

Perhaps you want:

 [x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]

This will not compare any item against itself, but compare each of the others against each other.

Solution 2

You could use NumPy broadcasting -

# Get the mask of comparisons in a vectorized manner using broadcasting
mask = a[:,None] >= a

# Select the elements other than diagonal ones
out = mask[~np.eye(a.size,dtype=bool)]

If you rather prefer to set the diagonal elements as False in mask and then mask would be the output, like so -

mask[np.eye(a.size,dtype=bool)] = 0

Sample run -

In [56]: a
Out[56]: array([3, 7, 5, 8])

In [57]: mask = a[:,None] >= a

In [58]: mask
Out[58]: 
array([[ True, False, False, False],
       [ True,  True,  True, False],
       [ True, False,  True, False],
       [ True,  True,  True,  True]], dtype=bool)

In [59]: mask[~np.eye(a.size,dtype=bool)] # Selecting non-diag elems
Out[59]: 
array([False, False, False,  True,  True, False,  True, False, False,
        True,  True,  True], dtype=bool)

In [60]: mask[np.eye(a.size,dtype=bool)] = 0 # Setting diag elems as False

In [61]: mask
Out[61]: 
array([[False, False, False, False],
       [ True, False,  True, False],
       [ True, False, False, False],
       [ True,  True,  True, False]], dtype=bool)

Runtime test

Reasons to use NumPy broadcasting? Performance! Let's see how with a large dataset -

In [34]: def pairwise_comp(A): # Using NumPy broadcasting    
    ...:     a = np.asarray(A) # Convert to array if not already so
    ...:     mask = a[:,None] >= a
    ...:     out = mask[~np.eye(a.size,dtype=bool)]
    ...:     return out
    ...: 

In [35]: a = np.random.randint(0,9,(1000)).tolist() # Input list

In [36]: %timeit [x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]
1 loop, best of 3: 185 ms per loop # @Sixhobbits's loopy soln

In [37]: %timeit pairwise_comp(a)
100 loops, best of 3: 5.76 ms per loop

Solution 3

I'd like to apply @Divakar's solution to pandas objects. Here are two approaches for calculating pairwise absolute differences.

(IPython 6.1.0 on Python 3.6.2)

In [1]: import pandas as pd
   ...: import numpy as np
   ...: import itertools

In [2]: n = 256
   ...: labels = range(n)
   ...: ser = pd.Series(np.random.randn(n), index=labels)
   ...: ser.head()
Out[2]: 
0    1.592248
1   -1.168560
2   -1.243902
3   -0.133140
4   -0.714133
dtype: float64

Loops

In [3]: %%time
   ...: result = dict()
   ...: for pair in itertools.combinations(labels, 2):
   ...:     a, b = pair
   ...:     a = ser[a]  # retrieve values
   ...:     b = ser[b]
   ...:     result[pair] = a - b

   ...: result = pd.Series(result).abs().reset_index()
   ...: result.columns = list('ABC')
   ...: df1 = result.pivot('A', 'B, 'C').reindex(index=labels, columns=labels)
   ...: df1 = df1.fillna(df1.T).fillna(0.)
CPU times: user 18.2 s, sys: 468 ms, total: 18.7 s
Wall time: 18.7 s

NumPy broadcast

In [4]: %%time
   ...: arr = ser.values
   ...: arr = arr[:, None] - arr
   ...: df2 = pd.DataFrame(arr, labels, labels).abs()
CPU times: user 816 µs, sys: 432 µs, total: 1.25 ms
Wall time: 675 µs

Verify they're equal:

In [5]: df1.equals(df2)
Out[5]: True

Using loops is about 20000 times slower than the clever NumPy approach. NumPy has many optimizations, but sometimes they need a different way of thinking. :-)

Solution 4

You may achieve that by using:

[x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i != j]

Issue with your code:

You are iterating in over list twice. If you convert your comprehension to loop, it will work like:

for x in a:
    for y in a:
        x>=y # which is your condition

Hence, it the the order of execution is as: (a1, a1), (a1, a2), ... , (a2, a1), (a2, a2), ... , (a4, a4)

Solution 5

Why are you worried about the a1>=a1 comparison. It may be pridictable, but skipping it might not be worth the extra work.

Make a list of 100 numbers

In [17]: a=list(range(100))

Compare them with the simple double loop; producing a 10000 values (100*100)

In [18]: len([x>=y for x in a for y in a])
Out[18]: 10000
In [19]: timeit [x>=y for x in a for y in a]
1000 loops, best of 3: 1.04 ms per loop

Now use @Moinuddin Quadri's enumerated loop to skip the 100 eye values:

In [20]: len([x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j])
Out[20]: 9900
In [21]: timeit [x>=y for i,x in enumerate(a) for j, y in enumerate(a) if i!=j]
100 loops, best of 3: 2.12 ms per loop

It takes 2x longer. Half the extra time is the enumerates, and half the if.

In this case working with numpy arrays is much faster, even when including the time to create the array.

xa = np.array(x); Z = xa[:,None]>=xa

But you can't get rid of the the diagonal values. They will the True; they can be flipped to False, but why. In a boolean array there are only 2 values.

The fastest solution is to write an indicator function that isn't bothered by these diagonal values.

Share:
19,458
Sue
Author by

Sue

Updated on June 15, 2022

Comments

  • Sue
    Sue almost 2 years

    Let me elaborate my question using a simple example.I have a=[a1,a2,a3,a4], with all ai being a numerical value.

    What I want to get is pairwise comparisons within 'a', such as I(a1>=a2), I(a1>=a3), I(a1>=a4), ,,,,I(a4>=a1), I(a4>=a2), I(a4>=a3), where I is a indicator function. So I used the following code.

    res=[x>=y for x in a for y in a]
    

    But it also gives the comparison results like I(a1>=a1),..,I(a4>=a4), which is always one. To get rid of these nuisance, I convert res into a numpy array and find the off diagonal elements.

    res1=numpy.array(res)
    

    This gives the result what I want, but I think there should be more efficient or simpler way to do pairwise comparison and extract the off diagonal element. Do you have any idea about this? Thanks in advance.

  • Jean-François Fabre
    Jean-François Fabre over 7 years
    what about [x >= y for i,x in enumerate(a) for j,y in enumerate(a) if i > j]
  • TemporalWolf
    TemporalWolf over 7 years
    @Jean-FrançoisFabre OP wants the reverse results as well: I(a1>=a4) and I(a4>=a1)
  • Sue
    Sue over 7 years
    Yes. I need both.
  • Sue
    Sue over 7 years
    Looks great! appreciate it.