Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

10,294

Solution 1

Using tuple and set: keep the order or tuple , and reorder with set

s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]: 
         0      1
aaa   True   True
bbb  False   True
ccc  False  False

Since cs95 mentioned apply have problem here

s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]: 
         0      1
aaa   True   True
bbb  False   True
ccc  False  False

Solution 2

Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.

df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)

For req 1, simply call DataFrame.equals (or just use the == op):

u = (df4 == df5).all(axis=1)
u

aaa     True
bbb    False
ccc    False
dtype: bool

Req 2 is slightly more complex, sort them along the first axis, then compare.

v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v

aaa     True
bbb     True
ccc    False
dtype: bool

Concatenate the results,

pd.concat([u, v], axis=1, keys=['X', 'Y'])

         X      Y
aaa   True   True
bbb  False   True
ccc  False  False

Solution 3

For item 2):

(df1.values == df2.values).all(axis=1)

This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.

For item 1), sort the values along each row first:

import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)
Share:
10,294

Related videos on Youtube

Younghak Jang
Author by

Younghak Jang

Updated on June 04, 2022

Comments

  • Younghak Jang
    Younghak Jang almost 2 years

    I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.

    ind = ['aaa', 'bbb', 'ccc']
    df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
    df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
    

    This is the output I need.

         OpX   OpY
    -------------
    aaa  True  True
    bbb  False True
    ccc  False False
    

    Could anyone help me with OpX and OpY?