Convert dataframe rows to Python set
Solution 1
A full implementation of what you want can be found here:
series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))
Solution 2
You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -
def jaccard_similarity_score(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
This is already faster, but let's see if we can get even faster.
Using senderle's fast cartesian_product
-
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
user46543
Updated on June 13, 2022Comments
-
user46543 almost 2 years
I have this dataset:
import pandas as pd import itertools A = ['A','B','C'] M = ['1','2','3'] F = ['plus','minus','square'] df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F']) print(df)
The example output is like this:
A M F 0 A 1 plus 1 A 1 minus 2 A 1 square 3 A 2 plus 4 A 2 minus 5 A 2 square
I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing
A 1 plus
andA 2 square
and get the similarity value between those both set.I have wrote a jaccard function:
def jaccard(a, b): c = a.intersection(b) return float(len(c)) / (len(a) + len(b) - len(c))
Which is only work on set because I used
intersection
I want the output like this (this expected result value is just random number):
0 1 2 3 45 0 1.00 0.43 0.61 0.55 0.46 1 0.43 1.00 0.52 0.56 0.49 2 0.61 0.52 1.00 0.48 0.53 3 0.55 0.56 0.48 1.00 0.49 45 0.46 0.49 0.53 0.49 1.00
What is the best way to get the result of pairwise metrics?
Thank you,
-
user46543 over 6 yearsThank you Sebastian, I have tried to get the result more than hour
-
Sebastian Mendez over 6 yearsHonestly I didn't think it would be this easy, I've just started using
apply
for "good-enough" solutions. -
user46543 over 6 yearsHi @Sebastian, is it possible to remove half of the result diagonally? Because it duplicates, right?
-
cs95 over 6 yearsNested apply? Wow, this is going to suffer on large inputs.
-
Sebastian Mendez over 6 yearsFor sure, but is there an easier way to take the Cartesian product of two Series and turn it into a DataFrame with a custom function? Or is there a better method to approach this? Like I said, this was my "good-enough" approach, but I'd love to see a more refined answer.
-
Sebastian Mendez over 6 years@user46543 see this answer.
-
cs95 over 6 yearsI invite you to do a timings comparison here: stackoverflow.com/a/47545491/4909087
-
Sebastian Mendez over 6 yearsDefinitely a far better solution, +1. Still, you can't deny the simplicity of the nested
apply
s, and it should only be a constant factor slower. Also, I edited my answer to usefrozenset
, completely forgot about that. -
cs95 over 6 years@Sebastian Admittedly yes, but I'm betting that constant factor is pretty big, and you should see the difference for moderately sized inputs (yes, for large inputs, this ends up becoming slow due to the combinatorial nature of the problem).
-
Sebastian Mendez over 6 yearsAlso, I'm not sure what's the computationally expensive part of this, but assuming it's applying
fnc
, you could reduce the time by about a factor of two by only applying it to the upper triangle. -
cs95 over 6 years@Sebastian That, and the function itself being inherently slow. What's more, working with columns of frozensets offer 0 benefits in terms of performance, as they are objects. This is the nature of OP's input. Yes, computing the upper triangle only should offer some more speed gain.