Pandas, how to filter a df to get unique entries?

12,234

Solution 1

one way is to sort the dataframe and then take the first after a groupby.

# first way
sorted = df.sort_values(['type', 'value'], ascending = [True, False])

first = sorted.groupby('type').first().reset_index()

another way does not necessarily take only the first one, so potentially it would keep all IDs corresponding to the same maximum (and not take just 1 of them)

# second way
grouped = df.groupby('type').agg({'value': max}).reset_index()
grouped = grouped.set_index(['type','value'])

second = grouped.join(df.set_index(['type', 'value']))

example:

data

ID  type    value
1   A   8
2   A   5
3   B   11
4   C   12
5   D   1
6   D   22
7   D   13
8   D   22

first method results in

type  ID  value
A   1      8
B   3     11
C   4     12
D   6     22

second method keeps ID=8

            ID
type value    
A    8       1
B    11      3
C    12      4
D    22      6
     22      8

(you can reset_index() again here if you don't like the multiindex)

Solution 2

df[['type', 'value']].drop_duplicates(subset=['type'])

This works generally, if you would have more columns, you can select the interested columns, in our case we chose all, 'type', 'value'.

Solution 3

Use groupby "type" and grab only the first object - df.groupby("type").first()

Solution 4

I prefer my way. Because groupby will create new df. You will get unique values. But tecnically this will not filter your df, this will create new one. My way will keep your indexes untouched, you will get the same df but without duplicates.

df = df.sort_values('value', ascending=False)
# this will return unique by column 'type' rows indexes
idx = df['type'].drop_duplicates().index
#this will return filtered df
df.loc[idx,:]
Share:
12,234

Related videos on Youtube

Gioelelm
Author by

Gioelelm

PhD Student at Karolinska Institutet. I am interested in python programming as a tool for scientific computing and analysis.

Updated on September 14, 2022

Comments

  • Gioelelm
    Gioelelm over 1 year

    I have a dataframe like this:

    ID  type value
    1   A    8
    2   A    5
    3   B    11
    4   C    12
    5   D    1
    6   D    22
    7   D    13
    

    I want to filter the dataframe so that I have a unique occurrence of "type" attrybute (e.g. A appears only once), and if there are more rows that have the same value for "type" I want to choose the one with higher value. I want to get something like:

    ID  type value
    1   A    8
    3   B    11
    4   C    12
    6   D    22
    

    How do I do this with pandas?

  • Gioelelm
    Gioelelm about 10 years
    Nice! In this way I loose the ID. How can I restore the previous layout?
  • Gioelelm
    Gioelelm about 10 years
    Fantastic answer. Still one problem. In reality my idexes are the IDs and they are unique identifier string like 'M_001' how can I restore those indexes? Saving it as a extra column and assigning them as a index afterwards?
  • mkln
    mkln about 10 years
    from above, I would do second.reset_index().set_index('ID')
  • shan.B
    shan.B almost 4 years
    If you could give some verbal information your answer will be much richer.
  • Jeremy Caney
    Jeremy Caney almost 4 years
    In addition to what @scriptmonster said, when answering old questions with an accepted answer, it's especially useful to explain why your approach is preferred over that answer.
  • legale
    legale almost 4 years
    Thank you for your advices.