Pandas DENSE RANK

13,683

Solution 1

Use pd.Series.rank with method='dense'

df['Rank'] = df.Year.rank(method='dense').astype(int)

df

enter image description here

Solution 2

The fastest solution is factorize:

df['Rank'] = pd.factorize(df.Year)[0] + 1

Timings:

#len(df)=40k
df = pd.concat([df]*10000).reset_index(drop=True)

In [13]: %timeit df['Rank'] = df.Year.rank(method='dense').astype(int)
1000 loops, best of 3: 1.55 ms per loop

In [14]: %timeit df['Rank1'] = df.Year.astype('category').cat.codes + 1
1000 loops, best of 3: 1.22 ms per loop

In [15]: %timeit df['Rank2'] = pd.factorize(df.Year)[0] + 1
1000 loops, best of 3: 737 µs per loop

Solution 3

You can convert the year to categoricals and then take their codes (adding one because they are zero indexed and you wanted the initial value to start with one per your example).

df['Rank'] = df.Year.astype('category').cat.codes + 1

>>> df
   Year  Value  Rank
0  2012     10     1
1  2013     20     2
2  2013     25     2
3  2014     30     3
Share:
13,683
Keithx
Author by

Keithx

Updated on July 15, 2022

Comments

  • Keithx
    Keithx almost 2 years

    I'm dealing with pandas dataframe and have a frame like this:

    Year Value  
    2012  10
    2013  20
    2013  25
    2014  30
    

    I want to make an equialent to DENSE_RANK () over (order by year) function. to make an additional column like this:

        Year Value Rank
        2012  10    1
        2013  20    2
        2013  25    2
        2014  30    3
    

    How can it be done in pandas?

    Thanks!

  • Oliver W.
    Oliver W. about 7 years
    Note that you will want to use sort=True in the call to factorize, which will impact your timings as well (in my randomly generated 3M large numerical df, method 1, i.e. using the rank method turns out to be the fastest). The reason you assumed it works, is because the array's non-duplicate elements were already sorted.
  • jezrael
    jezrael about 7 years
    Yes, but it depends if data are sort or not. In sample are sorted, so not necessary.
  • Oliver W.
    Oliver W. about 7 years
    Indeed, and that's what I said. Because it's sorted, factorize will be faster. In general, data is not sorted and so factorize and rank will return different answers. I added the comment as a warning to future readers, who would blindly take over solutions without checking the conditions under which they're assumed to work.
  • jezrael
    jezrael about 7 years
    @OliverW. - Thank you.
  • jezrael
    jezrael about 6 years
    @piRSquared - Thanks, it hapens. Your solution was upvoted by me ;)