Pandas - Alternative to rank() function that gives unique ordinal ranks for a column

python pandas ranking rank ordinal

13,686

I think the way you were trying to use the method=first to rank them after sorting were causing problems.

You could simply use the rank method with first arg on the grouped object itself giving you the desired unique ranks per group.

df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])

0     2
1     3
2     1
3     4
4     3
5     1
6     2
7     4
8     2
9     3
10    1
11    4
12    2
13    3
14    1
15    4
Name: new_rank, dtype: int32

Perform pivot operation:

df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
                    values=['adtext'], aggfunc=lambda x: ' '.join(x))

Choose the second level of the multiindex columns which pertain to the rank numbers:

df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df

Data:(to replicate)

df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
                             'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
                  'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
                             'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
                  'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
                             'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
                  'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9, 
                              2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
                  'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
                             'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})

Note: method=first assigns ranks in the order they appear in the array/series.

13,686

Author by

Fluous

Updated on June 11, 2022

Comments

Fluous almost 2 years

At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinal ranks within a group.

My Python and Pandas knowledge is limited as I am just a beginner.

The Goal

I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.

Table 1: Data structure I am trying to achieve

    device  , weeks   , rank_1   , rank_2   , rank_3   , rank_4   , rank_5
    mobile  , wk 1    , string   , string   , string   , string   , string 
    mobile  , wk 2    , string   , string   , string   , string   , string 
    computer, wk 1    , string   , string   , string   , string   , string
    computer, wk 2    , string   , string   , string   , string   , string

The Problem

The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.

Table 2: Data structure I have

    weeks    device   , website  , ranking  , adtext
    wk 1     mobile   , url1     , *2.1     , string
    wk 1     mobile   , url2     , *2.1     , string
    wk 1     mobile   , url3     , 1.0      , string
    wk 1     mobile   , url4     , 2.9      , string
    wk 1     desktop  , *url5    , 2.1      , string
    wk 1     desktop  , url2     , *1.5     , string
    wk 1     desktop  , url3     , *1.5     , string
    wk 1     desktop  , url4     , 2.9      , string
    wk 2     mobile   , url1     , 2.0      , string
    wk 2     mobile   , *url6    , 2.1      , string
    wk 2     mobile   , url3     , 1.0      , string
    wk 2     mobile   , url4     , 2.9      , string
    wk 2     desktop  , *url5    , 2.1      , string
    wk 2     desktop  , url2     , *2.9     , string
    wk 2     desktop  , url3     , 1.0      , string
    wk 2     desktop  , url4     , *2.9     , string

Table 3: The table I cannot seem to create

    weeks    device   , website  , ranking  , adtext  , ranking
    wk 1     mobile   , url1     , *2.1     , string  , 2
    wk 1     mobile   , url2     , *2.1     , string  , 3
    wk 1     mobile   , url3     , 1.0      , string  , 1
    wk 1     mobile   , url4     , 2.9      , string  , 4
    wk 1     desktop  , *url5    , 2.1      , string  , 3
    wk 1     desktop  , url2     , *1.5     , string  , 1
    wk 1     desktop  , url3     , *1.5     , string  , 2
    wk 1     desktop  , url4     , 2.9      , string  , 4
    wk 2     mobile   , url1     , 2.0      , string  , 2
    wk 2     mobile   , *url6    , 2.1      , string  , 3
    wk 2     mobile   , url3     , 1.0      , string  , 1
    wk 2     mobile   , url4     , 2.9      , string  , 4
    wk 2     desktop  , *url5    , 2.1      , string  , 2
    wk 2     desktop  , url2     , *2.9     , string  , 3
    wk 2     desktop  , url3     , 1.0      , string  , 1
    wk 2     desktop  , url4     , *2.9     , string  , 4

The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)

The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)

The .rank(method="first", ascending=True) throws a ValueError

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)

ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.

df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))

My question to you

I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.

Thank you!

Sebastiaan

EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.

So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.

Also Steven G asked me for an example to play around with. I have provided that here.

Example data can be pasted directly into python

! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.

data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1',
u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1',
u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4',
u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2',
u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3',
u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4',
u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2',
u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'],
u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at',
u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk',
u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com',
u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com',
u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com',
u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com',
u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com',
u'alsoasite.com'],
'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%',
u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %',
u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %',
u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %',
u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %'
, u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %'
, u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %',
u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%',
u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%',
u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %',
u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%',
u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %',
u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %',
u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %',
u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%',
u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%',
u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%',
u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%',
u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%',
u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %',
u'alsoasite.com 1,1 | < 10\xa0%'],
u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %',
u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %',
u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %',
u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %',
u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %',
u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%',
u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'],
u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8',
u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2',
u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4',
u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2',
u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9',
u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'],
u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet',
u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile',
u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet',
u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet',
u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet',
u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Mobile']}

placeholder = pd.DataFrame(data)

Error I receive when I use the rank() function with method='first'

C:\Users\username\code\report-creator>python recomp-report-04.py
Traceback (most recent call last):
  File "recomp-report-04.py", line 71, in <module>
    placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl
e'].rank(method='first').astype(int)
  File "<string>", line 35, in rank
  File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand
as\core\groupby.py", line 561, in wrapper
    raise ValueError
ValueError

My solution

Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.

Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.

for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']):
    df['new_rank'] = group['ranking'].rank(method='first').astype(int)

This gave me the following error:

ValueError: first not supported for non-numeric data

So this helped me realize that I should convert the column to floats. This is how I did it.

# Converting the ranking column to a float
df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.'))))

# Creating a new column with a rank
df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int)

# Dropping all ranks after the 10
df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10)

# Pivotting the column
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x))

# Naming the columns with 'top' + number
df.columns = ['top ' + str(i) for i in list(df.columns.values)]

So this worked for me. Thank you guys!

Fluous over 7 years

Unfortunately I still get a valueError. In order to make things more clear I will add an actual example in a comment under my original post.
Nickil Maveli over 7 years

Do you mean to say the values for the new_rank are not in the order shown? But this seems to follow the same order as posted in your expected DF. Regarding the value error, you must post the complete traceback included in your question.
Fluous over 7 years

Hey @Nickil, I have added the error in my original post at the end. I am not sure whether the values of new_rank are not in the order shown, because I cannot replicate it as of yet. However, I was wondering how <method='first'> ranks, are these ordinal ranks on the basis of the value of another column or ranks does it rank the rows based on something else? Also I wanted to thank you for all your previous effort! It is much appreciated!
Nickil Maveli over 7 years

I've edited the post to show you the DF used to replicate the values. There's nothing wrong with way groupby.rank is functioning. The problem might be somewhere else as you've only provided a sample of your data. And why are there extra commas in Ranking and also sometimes in ImpressionShare? I couldn't simulate it because of those incorrect parsing.
Fluous over 7 years

Hey @Nickil, that is the German notation. I think this is the reason I get the error. After trying it on another way, I got a more elaborate error that said the following. ValueError: first not supported for non-numeric data So when I figured that out, I decided to look into how to transform the column into floats! I will keep you guys posted.
Nickil Maveli over 7 years

Yeah, method=first can be applied only to numeric data. But why is it that in rank_google column are you getting the numbers separated by commas. Shouldn't it be just a number? If you can somehow figure that out, this would work. Just so you know, there's also a method=dense which could be applied to non-numeric data(strings).