Pandas: Find rows which don't exist in another DataFrame by multiple columns
36,036
Solution 1
Since 0.17.0
there is a new indicator
param you can pass to merge
which will tell you whether the rows are only present in left, right or both:
In [5]:
merged = df.merge(other, how='left', indicator=True)
merged
Out[5]:
col1 col2 extra_col _merge
0 0 a this left_only
1 1 b is both
2 1 c just left_only
3 2 b something left_only
In [6]:
merged[merged['_merge']=='left_only']
Out[6]:
col1 col2 extra_col _merge
0 0 a this left_only
2 1 c just left_only
3 2 b something left_only
So you can now filter the merged df by selecting only 'left_only'
rows
Solution 2
Interesting
cols = ['col1','col2']
#get copies where the indeces are the columns of interest
df2 = df.set_index(cols)
other2 = other.set_index(cols)
#Look for index overlap, ~
df[~df2.index.isin(other2.index)]
Returns:
col1 col2 extra_col
0 0 a this
2 1 c just
3 2 b something
Seems a little bit more elegant...
Author by
Pekka
Updated on February 13, 2020Comments
-
Pekka about 4 years
same as this python pandas: how to find rows in one dataframe but not in another? but with multiple columns
This is the setup:
import pandas as pd df = pd.DataFrame(dict( col1=[0,1,1,2], col2=['a','b','c','b'], extra_col=['this','is','just','something'] )) other = pd.DataFrame(dict( col1=[1,2], col2=['b','c'] ))
Now, I want to select the rows from
df
which don't exist in other. I want to do the selection bycol1
andcol2
In SQL I would do:
select * from df where not exists ( select * from other o where df.col1 = o.col1 and df.col2 = o.col2 )
And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.
key_col = ['col1','col2'] df_with_idx = df.reset_index() common = pd.merge(df_with_idx,other,on=key_col)['index'] mask = df_with_idx['index'].isin(common) desired_result = df_with_idx[~mask].drop('index',axis=1)
So maybe there is some more elegant way?