Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

python pandas dataframe

10,294

Solution 1

Using tuple and set: keep the order or tuple , and reorder with set

s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]: 
         0      1
aaa   True   True
bbb  False   True
ccc  False  False

Since cs95 mentioned apply have problem here

s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]: 
         0      1
aaa   True   True
bbb  False   True
ccc  False  False

Solution 2

Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.

df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)

For req 1, simply call DataFrame.equals (or just use the == op):

u = (df4 == df5).all(axis=1)
u

aaa     True
bbb    False
ccc    False
dtype: bool

Req 2 is slightly more complex, sort them along the first axis, then compare.

v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v

aaa     True
bbb     True
ccc    False
dtype: bool

Concatenate the results,

pd.concat([u, v], axis=1, keys=['X', 'Y'])

         X      Y
aaa   True   True
bbb  False   True
ccc  False  False

Solution 3

For item 2):

(df1.values == df2.values).all(axis=1)

This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.

For item 1), sort the values along each row first:

import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)

10,294

Younghak Jang

Updated on June 04, 2022

Comments

Younghak Jang almost 2 years
I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.
```
ind = ['aaa', 'bbb', 'ccc']
df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
```
This is the output I need.
```
     OpX   OpY
-------------
aaa  True  True
bbb  False True
ccc  False False
```
Could anyone help me with OpX and OpY?