Pandas - Alternative to rank() function that gives unique ordinal ranks for a column

13,686

I think the way you were trying to use the method=first to rank them after sorting were causing problems.

You could simply use the rank method with first arg on the grouped object itself giving you the desired unique ranks per group.

df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])

0     2
1     3
2     1
3     4
4     3
5     1
6     2
7     4
8     2
9     3
10    1
11    4
12    2
13    3
14    1
15    4
Name: new_rank, dtype: int32

Perform pivot operation:

df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
                    values=['adtext'], aggfunc=lambda x: ' '.join(x))

Choose the second level of the multiindex columns which pertain to the rank numbers:

df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df

Image_2


Data:(to replicate)

df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
                             'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
                  'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
                             'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
                  'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
                             'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
                  'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9, 
                              2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
                  'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
                             'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})

Note: method=first assigns ranks in the order they appear in the array/series.

Share:
13,686
Fluous
Author by

Fluous

Updated on June 11, 2022

Comments

  • Fluous
    Fluous almost 2 years

    At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinal ranks within a group.

    My Python and Pandas knowledge is limited as I am just a beginner.

    The Goal

    I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.

    Table 1: Data structure I am trying to achieve

        device  , weeks   , rank_1   , rank_2   , rank_3   , rank_4   , rank_5
        mobile  , wk 1    , string   , string   , string   , string   , string 
        mobile  , wk 2    , string   , string   , string   , string   , string 
        computer, wk 1    , string   , string   , string   , string   , string
        computer, wk 2    , string   , string   , string   , string   , string
    

    The Problem

    The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.

    Table 2: Data structure I have

        weeks    device   , website  , ranking  , adtext
        wk 1     mobile   , url1     , *2.1     , string
        wk 1     mobile   , url2     , *2.1     , string
        wk 1     mobile   , url3     , 1.0      , string
        wk 1     mobile   , url4     , 2.9      , string
        wk 1     desktop  , *url5    , 2.1      , string
        wk 1     desktop  , url2     , *1.5     , string
        wk 1     desktop  , url3     , *1.5     , string
        wk 1     desktop  , url4     , 2.9      , string
        wk 2     mobile   , url1     , 2.0      , string
        wk 2     mobile   , *url6    , 2.1      , string
        wk 2     mobile   , url3     , 1.0      , string
        wk 2     mobile   , url4     , 2.9      , string
        wk 2     desktop  , *url5    , 2.1      , string
        wk 2     desktop  , url2     , *2.9     , string
        wk 2     desktop  , url3     , 1.0      , string
        wk 2     desktop  , url4     , *2.9     , string
    

    Table 3: The table I cannot seem to create

        weeks    device   , website  , ranking  , adtext  , ranking
        wk 1     mobile   , url1     , *2.1     , string  , 2
        wk 1     mobile   , url2     , *2.1     , string  , 3
        wk 1     mobile   , url3     , 1.0      , string  , 1
        wk 1     mobile   , url4     , 2.9      , string  , 4
        wk 1     desktop  , *url5    , 2.1      , string  , 3
        wk 1     desktop  , url2     , *1.5     , string  , 1
        wk 1     desktop  , url3     , *1.5     , string  , 2
        wk 1     desktop  , url4     , 2.9      , string  , 4
        wk 2     mobile   , url1     , 2.0      , string  , 2
        wk 2     mobile   , *url6    , 2.1      , string  , 3
        wk 2     mobile   , url3     , 1.0      , string  , 1
        wk 2     mobile   , url4     , 2.9      , string  , 4
        wk 2     desktop  , *url5    , 2.1      , string  , 2
        wk 2     desktop  , url2     , *2.9     , string  , 3
        wk 2     desktop  , url3     , 1.0      , string  , 1
        wk 2     desktop  , url4     , *2.9     , string  , 4
    

    The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.

    df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
    
    df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)
    

    The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem

    df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
    
    df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)
    

    The .rank(method="first", ascending=True) throws a ValueError

    df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])
    
    df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)
    

    ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.

    df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))
    

    My question to you

    I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.

    Thank you!

    Sebastiaan


    EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.

    So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.

    Also Steven G asked me for an example to play around with. I have provided that here.

    Example data can be pasted directly into python

    ! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.

    data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1',
    u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1',
    u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4',
    u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2',
    u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3',
    u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4',
    u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2',
    u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'],
    u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com',
    u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
    u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
    u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at',
    u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk',
    u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk',
    u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se',
    u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com',
    u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com',
    u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com',
    u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com',
    u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com',
    u'alsoasite.com'],
    'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%',
    u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %',
    u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %',
    u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %',
    u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %'
    , u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %'
    , u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %',
    u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%',
    u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%',
    u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %',
    u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%',
    u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %',
    u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %',
    u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %',
    u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%',
    u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%',
    u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%',
    u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%',
    u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%',
    u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %',
    u'alsoasite.com 1,1 | < 10\xa0%'],
    u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %',
    u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %',
    u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %',
    u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%',
    u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %',
    u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %',
    u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%',
    u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%',
    u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%',
    u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'],
    u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8',
    u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2',
    u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4',
    u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2',
    u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9',
    u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'],
    u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer',
    u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet',
    u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile',
    u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet',
    u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet',
    u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet',
    u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer',
    u'Mobile', u'Tablet', u'Mobile', u'Mobile']}
    
    placeholder = pd.DataFrame(data)
    

    Error I receive when I use the rank() function with method='first'

    C:\Users\username\code\report-creator>python recomp-report-04.py
    Traceback (most recent call last):
      File "recomp-report-04.py", line 71, in <module>
        placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl
    e'].rank(method='first').astype(int)
      File "<string>", line 35, in rank
      File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand
    as\core\groupby.py", line 561, in wrapper
        raise ValueError
    ValueError
    

    My solution

    Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.

    Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.

    for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']):
        df['new_rank'] = group['ranking'].rank(method='first').astype(int)
    

    This gave me the following error:

    ValueError: first not supported for non-numeric data
    

    So this helped me realize that I should convert the column to floats. This is how I did it.

    # Converting the ranking column to a float
    df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.'))))
    
    # Creating a new column with a rank
    df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int)
    
    # Dropping all ranks after the 10
    df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10)
    
    # Pivotting the column
    df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x))
    
    # Naming the columns with 'top' + number
    df.columns = ['top ' + str(i) for i in list(df.columns.values)]
    

    So this worked for me. Thank you guys!

  • Fluous
    Fluous over 7 years
    Unfortunately I still get a valueError. In order to make things more clear I will add an actual example in a comment under my original post.
  • Nickil Maveli
    Nickil Maveli over 7 years
    Do you mean to say the values for the new_rank are not in the order shown? But this seems to follow the same order as posted in your expected DF. Regarding the value error, you must post the complete traceback included in your question.
  • Fluous
    Fluous over 7 years
    Hey @Nickil, I have added the error in my original post at the end. I am not sure whether the values of new_rank are not in the order shown, because I cannot replicate it as of yet. However, I was wondering how <method='first'> ranks, are these ordinal ranks on the basis of the value of another column or ranks does it rank the rows based on something else? Also I wanted to thank you for all your previous effort! It is much appreciated!
  • Nickil Maveli
    Nickil Maveli over 7 years
    I've edited the post to show you the DF used to replicate the values. There's nothing wrong with way groupby.rank is functioning. The problem might be somewhere else as you've only provided a sample of your data. And why are there extra commas in Ranking and also sometimes in ImpressionShare? I couldn't simulate it because of those incorrect parsing.
  • Fluous
    Fluous over 7 years
    Hey @Nickil, that is the German notation. I think this is the reason I get the error. After trying it on another way, I got a more elaborate error that said the following. ValueError: first not supported for non-numeric data So when I figured that out, I decided to look into how to transform the column into floats! I will keep you guys posted.
  • Nickil Maveli
    Nickil Maveli over 7 years
    Yeah, method=first can be applied only to numeric data. But why is it that in rank_google column are you getting the numbers separated by commas. Shouldn't it be just a number? If you can somehow figure that out, this would work. Just so you know, there's also a method=dense which could be applied to non-numeric data(strings).