Pandas - Alternative to rank() function that gives unique ordinal ranks for a column
I think the way you were trying to use the method=first
to rank them after sorting were causing problems.
You could simply use the rank method with first
arg on the grouped object itself giving you the desired unique ranks per group.
df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])
0 2
1 3
2 1
3 4
4 3
5 1
6 2
7 4
8 2
9 3
10 1
11 4
12 2
13 3
14 1
15 4
Name: new_rank, dtype: int32
Perform pivot operation:
df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
values=['adtext'], aggfunc=lambda x: ' '.join(x))
Choose the second level of the multiindex columns which pertain to the rank numbers:
df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df
Data:(to replicate)
df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9,
2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})
Note: method=first
assigns ranks in the order they appear in the array/series.
Fluous
Updated on June 11, 2022Comments
-
Fluous almost 2 years
At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinal ranks within a group.
My Python and Pandas knowledge is limited as I am just a beginner.
The Goal
I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.
Table 1: Data structure I am trying to achieve
device , weeks , rank_1 , rank_2 , rank_3 , rank_4 , rank_5 mobile , wk 1 , string , string , string , string , string mobile , wk 2 , string , string , string , string , string computer, wk 1 , string , string , string , string , string computer, wk 2 , string , string , string , string , string
The Problem
The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.
Table 2: Data structure I have
weeks device , website , ranking , adtext wk 1 mobile , url1 , *2.1 , string wk 1 mobile , url2 , *2.1 , string wk 1 mobile , url3 , 1.0 , string wk 1 mobile , url4 , 2.9 , string wk 1 desktop , *url5 , 2.1 , string wk 1 desktop , url2 , *1.5 , string wk 1 desktop , url3 , *1.5 , string wk 1 desktop , url4 , 2.9 , string wk 2 mobile , url1 , 2.0 , string wk 2 mobile , *url6 , 2.1 , string wk 2 mobile , url3 , 1.0 , string wk 2 mobile , url4 , 2.9 , string wk 2 desktop , *url5 , 2.1 , string wk 2 desktop , url2 , *2.9 , string wk 2 desktop , url3 , 1.0 , string wk 2 desktop , url4 , *2.9 , string
Table 3: The table I cannot seem to create
weeks device , website , ranking , adtext , ranking wk 1 mobile , url1 , *2.1 , string , 2 wk 1 mobile , url2 , *2.1 , string , 3 wk 1 mobile , url3 , 1.0 , string , 1 wk 1 mobile , url4 , 2.9 , string , 4 wk 1 desktop , *url5 , 2.1 , string , 3 wk 1 desktop , url2 , *1.5 , string , 1 wk 1 desktop , url3 , *1.5 , string , 2 wk 1 desktop , url4 , 2.9 , string , 4 wk 2 mobile , url1 , 2.0 , string , 2 wk 2 mobile , *url6 , 2.1 , string , 3 wk 2 mobile , url3 , 1.0 , string , 1 wk 2 mobile , url4 , 2.9 , string , 4 wk 2 desktop , *url5 , 2.1 , string , 2 wk 2 desktop , url2 , *2.9 , string , 3 wk 2 desktop , url3 , 1.0 , string , 1 wk 2 desktop , url4 , *2.9 , string , 4
The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True]) df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)
The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True]) df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)
The .rank(method="first", ascending=True) throws a ValueError
df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True]) df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)
ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))
My question to you
I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.
Thank you!
Sebastiaan
EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.
So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.
Also Steven G asked me for an example to play around with. I have provided that here.
Example data can be pasted directly into python
! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.
data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1', u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2', u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3', u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4', u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'], u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com', u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at', u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se', u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com', u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com', u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com', u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com', u'alsoasite.com'], 'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%', u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %', u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %', u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %', u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %' , u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %' , u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %', u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%', u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %', u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%', u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %', u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %', u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %', u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%', u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%', u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%', u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%', u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%', u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %', u'alsoasite.com 1,1 | < 10\xa0%'], u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %', u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %', u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %', u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %', u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %', u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%', u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'], u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8', u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2', u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4', u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2', u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9', u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'], u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer', u'Mobile', u'Tablet', u'Mobile', u'Mobile']} placeholder = pd.DataFrame(data)
Error I receive when I use the rank() function with method='first'
C:\Users\username\code\report-creator>python recomp-report-04.py Traceback (most recent call last): File "recomp-report-04.py", line 71, in <module> placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl e'].rank(method='first').astype(int) File "<string>", line 35, in rank File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand as\core\groupby.py", line 561, in wrapper raise ValueError ValueError
My solution
Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.
Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.
for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']): df['new_rank'] = group['ranking'].rank(method='first').astype(int)
This gave me the following error:
ValueError: first not supported for non-numeric data
So this helped me realize that I should convert the column to floats. This is how I did it.
# Converting the ranking column to a float df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.')))) # Creating a new column with a rank df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int) # Dropping all ranks after the 10 df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10) # Pivotting the column df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x)) # Naming the columns with 'top' + number df.columns = ['top ' + str(i) for i in list(df.columns.values)]
So this worked for me. Thank you guys!
-
Fluous over 7 yearsUnfortunately I still get a valueError. In order to make things more clear I will add an actual example in a comment under my original post.
-
Nickil Maveli over 7 yearsDo you mean to say the values for the
new_rank
are not in the order shown? But this seems to follow the same order as posted in your expectedDF
. Regarding the value error, you must post the complete traceback included in your question. -
Fluous over 7 yearsHey @Nickil, I have added the error in my original post at the end. I am not sure whether the values of new_rank are not in the order shown, because I cannot replicate it as of yet. However, I was wondering how <method='first'> ranks, are these ordinal ranks on the basis of the value of another column or ranks does it rank the rows based on something else? Also I wanted to thank you for all your previous effort! It is much appreciated!
-
Nickil Maveli over 7 yearsI've edited the post to show you the
DF
used to replicate the values. There's nothing wrong with waygroupby.rank
is functioning. The problem might be somewhere else as you've only provided a sample of your data. And why are there extra commas inRanking
and also sometimes inImpressionShare
? I couldn't simulate it because of those incorrect parsing. -
Fluous over 7 yearsHey @Nickil, that is the German notation. I think this is the reason I get the error. After trying it on another way, I got a more elaborate error that said the following.
ValueError: first not supported for non-numeric data
So when I figured that out, I decided to look into how to transform the column into floats! I will keep you guys posted. -
Nickil Maveli over 7 yearsYeah,
method=first
can be applied only to numeric data. But why is it that inrank_google
column are you getting the numbers separated by commas. Shouldn't it be just a number? If you can somehow figure that out, this would work. Just so you know, there's also amethod=dense
which could be applied to non-numeric data(strings).