Find all duplicate rows in a pandas dataframe
12,356
First filter all duplicated
rows and then groupby
with apply
or convert index
to_series
:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc
:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
Author by
Nico
Updated on June 07, 2022Comments
-
Nico almost 2 years
I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
col 1 | 1 2 | 2 3 | 1 4 | 1 5 | 2
I'd like to be able to get
[1, 3, 4]
and[2, 5]
. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something likedf[col == x...]
. -
Nico about 7 yearsOkay that's good, except that since I don't know the columns I need to groupby df.columns, but that's fine. I don't know how I didn't think of groupby by myself.
-
jezrael about 7 yearsI add solution for select by position.
-
Nabin about 6 yearsCan this find duplicate rows with multiple columns too? I mean I see only col in the example not col1, col2, col3, and so on.
-
jezrael about 6 years@nabin For check dupes in multiple columns use
df = df[df.duplicated(subset=['col','col1','col2'], keep=False)]
, if want check dupes by all columnsdf = df[df.duplicated(keep=False)]
-
Sarah Lissachell over 3 yearsThis is just what I needed