Convert dataframe rows to Python set

python pandas dataframe set similarity

10,448

Solution 1

A full implementation of what you want can be found here:

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))

Solution 2

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)

y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get even faster.

Using senderle's fast cartesian_product -

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

10,448

Author by

user46543

Updated on June 13, 2022

Comments

user46543 almost 2 years

I have this dataset:

import pandas as pd
import itertools

A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']

df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)

The example output is like this:

   A  M       F
0   A  1    plus
1   A  1   minus
2   A  1  square
3   A  2    plus
4   A  2   minus
5   A  2  square

I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

A 1 plus and A 2 square and get the similarity value between those both set.

I have wrote a jaccard function:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Which is only work on set because I used intersection

I want the output like this (this expected result value is just random number):

    0     1     2     3     45
0  1.00  0.43  0.61  0.55  0.46
1  0.43  1.00  0.52  0.56  0.49
2  0.61  0.52  1.00  0.48  0.53
3  0.55  0.56  0.48  1.00  0.49
45  0.46  0.49  0.53  0.49  1.00

What is the best way to get the result of pairwise metrics?

Thank you,

user46543 over 6 years

Thank you Sebastian, I have tried to get the result more than hour
Sebastian Mendez over 6 years

Honestly I didn't think it would be this easy, I've just started using apply for "good-enough" solutions.
user46543 over 6 years

Hi @Sebastian, is it possible to remove half of the result diagonally? Because it duplicates, right?
cs95 over 6 years

Nested apply? Wow, this is going to suffer on large inputs.
Sebastian Mendez over 6 years

For sure, but is there an easier way to take the Cartesian product of two Series and turn it into a DataFrame with a custom function? Or is there a better method to approach this? Like I said, this was my "good-enough" approach, but I'd love to see a more refined answer.
Sebastian Mendez over 6 years

@user46543 see this answer.
cs95 over 6 years

I invite you to do a timings comparison here: stackoverflow.com/a/47545491/4909087
Sebastian Mendez over 6 years

Definitely a far better solution, +1. Still, you can't deny the simplicity of the nested applys, and it should only be a constant factor slower. Also, I edited my answer to use frozenset, completely forgot about that.
cs95 over 6 years

@Sebastian Admittedly yes, but I'm betting that constant factor is pretty big, and you should see the difference for moderately sized inputs (yes, for large inputs, this ends up becoming slow due to the combinatorial nature of the problem).
Sebastian Mendez over 6 years

Also, I'm not sure what's the computationally expensive part of this, but assuming it's applying fnc, you could reduce the time by about a factor of two by only applying it to the upper triangle.
cs95 over 6 years

@Sebastian That, and the function itself being inherently slow. What's more, working with columns of frozensets offer 0 benefits in terms of performance, as they are objects. This is the nature of OP's input. Yes, computing the upper triangle only should offer some more speed gain.