Convert dataframe rows to Python set

10,448

Solution 1

A full implementation of what you want can be found here:

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))

Solution 2

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get even faster.


Using senderle's fast cartesian_product -

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])
Share:
10,448
user46543
Author by

user46543

Updated on June 13, 2022

Comments

  • user46543
    user46543 almost 2 years

    I have this dataset:

    import pandas as pd
    import itertools
    
    A = ['A','B','C']
    M = ['1','2','3']
    F = ['plus','minus','square']
    
    df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
    print(df)
    

    The example output is like this:

       A  M       F
    0   A  1    plus
    1   A  1   minus
    2   A  1  square
    3   A  2    plus
    4   A  2   minus
    5   A  2  square
    

    I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

    A 1 plus and A 2 square and get the similarity value between those both set.

    I have wrote a jaccard function:

    def jaccard(a, b):
        c = a.intersection(b)
        return float(len(c)) / (len(a) + len(b) - len(c))
    

    Which is only work on set because I used intersection

    I want the output like this (this expected result value is just random number):

        0     1     2     3     45
    0  1.00  0.43  0.61  0.55  0.46
    1  0.43  1.00  0.52  0.56  0.49
    2  0.61  0.52  1.00  0.48  0.53
    3  0.55  0.56  0.48  1.00  0.49
    45  0.46  0.49  0.53  0.49  1.00
    

    What is the best way to get the result of pairwise metrics?

    Thank you,

  • user46543
    user46543 over 6 years
    Thank you Sebastian, I have tried to get the result more than hour
  • Sebastian Mendez
    Sebastian Mendez over 6 years
    Honestly I didn't think it would be this easy, I've just started using apply for "good-enough" solutions.
  • user46543
    user46543 over 6 years
    Hi @Sebastian, is it possible to remove half of the result diagonally? Because it duplicates, right?
  • cs95
    cs95 over 6 years
    Nested apply? Wow, this is going to suffer on large inputs.
  • Sebastian Mendez
    Sebastian Mendez over 6 years
    For sure, but is there an easier way to take the Cartesian product of two Series and turn it into a DataFrame with a custom function? Or is there a better method to approach this? Like I said, this was my "good-enough" approach, but I'd love to see a more refined answer.
  • Sebastian Mendez
    Sebastian Mendez over 6 years
    @user46543 see this answer.
  • cs95
    cs95 over 6 years
    I invite you to do a timings comparison here: stackoverflow.com/a/47545491/4909087
  • Sebastian Mendez
    Sebastian Mendez over 6 years
    Definitely a far better solution, +1. Still, you can't deny the simplicity of the nested applys, and it should only be a constant factor slower. Also, I edited my answer to use frozenset, completely forgot about that.
  • cs95
    cs95 over 6 years
    @Sebastian Admittedly yes, but I'm betting that constant factor is pretty big, and you should see the difference for moderately sized inputs (yes, for large inputs, this ends up becoming slow due to the combinatorial nature of the problem).
  • Sebastian Mendez
    Sebastian Mendez over 6 years
    Also, I'm not sure what's the computationally expensive part of this, but assuming it's applying fnc, you could reduce the time by about a factor of two by only applying it to the upper triangle.
  • cs95
    cs95 over 6 years
    @Sebastian That, and the function itself being inherently slow. What's more, working with columns of frozensets offer 0 benefits in terms of performance, as they are objects. This is the nature of OP's input. Yes, computing the upper triangle only should offer some more speed gain.