Sort a dataframe column by the frequency of occurrence

11,592

Solution 1

One possible approach is to use data.table to add freq column, then sort your data accordingly:

library(data.table)
setDT(df)[,freq := .N, by = c("Region","Salary")]

# Sort
df[order(freq, decreasing = T),]

# As a oneliner (thx @Jaap)
setDT(df)[, freq := .N, by = .(Region,Salary)][order(-freq)]

Solution 2

Base R

df <- transform(df, freq= ave(seq(nrow(df)), Salary, FUN=length))
df[order(-df$freq), ]

Output:

   Region  ID Salary freq
6       2  A6   1002    3
7       2  A7   1002    3
8       2  A8   1002    3
2       1  A2   1001    2
5       1  A5   1001    2
9       3  A9   3001    2
10      3 A10   3001    2
1       1  A1    100    1
3       1  A3   2000    1
4       1  A4   2431    1
11      3 A11   4001    1

dplyr

library(dplyr)
df %>%
  add_count(Salary) %>% 
  arrange(desc(n))

Output:

  Region  ID Salary n
1       2  A6   1002 3
2       2  A7   1002 3
3       2  A8   1002 3
4       1  A2   1001 2
5       1  A5   1001 2
6       3  A9   3001 2
7       3 A10   3001 2
8       1  A1    100 1
9       1  A3   2000 1
10      1  A4   2431 1
11      3 A11   4001 1

Solution 3

library(dplyr)
data %>% group_by(Region, Salary) %>% summarise(n=n()) %>% ungroup() %>% arrange(-n)

(edited: incloude commentary on the ungroupfrom Jaap, good point)

Share:
11,592
Anubhav Dikshit
Author by

Anubhav Dikshit

Updated on June 05, 2022

Comments

  • Anubhav Dikshit
    Anubhav Dikshit almost 2 years

    I have a dataframe in called df, there are three column lets say,

    Region ID  Salary
    1      A1  100
    1      A2  1001
    1      A3  2000
    1      A4  2431
    1      A5  1001
    ..............
    ..............
    2      A6  1002
    2      A7  1002
    2      A8  1002
    3      A9  3001
    3      A10 3001
    3      A11 4001
    

    Now I want to sort column Salary by the occurrence of them by Region, that is using frequency table or something, get the probability of occurrence per region and sort them. Please assume that the dataset is large enough (1000 rows)

    P.S: Can anyone suggest a good method to do some. Please use column name in your answers since the real table has some column in the middle

    Thanks in advance

                      **EDIT 1**
    

    I think I was not clear enough, thanks for all those who replied, I sincerely apologise for not being clear:

    With the current dataset we need to create a frequency table say:

    Region  Salary(bin)     Count
    1       1K              6                   
    1       5K              3                   
    1       2K              2                   
    1       15K             2                   
    1       0.5K            2                   
    1       24K             1                   
    1       0K              0                   
    

    using this we can classify add a new columns in our data frame df called bin(bucket from histogram)

    Region     ID  Salary  (bin)   Count
        1      A1  100     1K      6
        1      A2  1001    2K      2
        1      A3  2000    2K      2
        1      A4  2431    5K      3
    

    ..........................So on...............

    We can do the above using:

    df$bin <- cut(df$salary, breaks=hist(df$salary)$breaks)
    

    After sorting by Region and Count and Salary we get:

    Region     ID  Salary  (bin)   Count
        1      A1  100     1K      6
        1      A4  2431    5K      3
        1      A3  2000    2K      2
        1      A2  1001    2K      2
    

    As you can see, we need to create frequency table for each region and do sort. I did the above using Tableau but I want to automate this in R

    Hope I was clear

    • Anubhav Dikshit
      Anubhav Dikshit over 8 years
      Does not this just sort by the numerical value instead of the frequency of occurance?
    • mtoto
      mtoto over 8 years
      you can still use the below approaches as far as I can see.
    • Anubhav Dikshit
      Anubhav Dikshit over 8 years
      Yes, but I agree ( hence the upvotes), was hoping for a more elegant solution than mine
  • Jaap
    Jaap over 8 years
    even shorter: setDT(df)[, .N, by = .(Region,Salary)][order(-N)]
  • Jaap
    Jaap over 8 years
    or: df %>% group_by(Region, Salary) %>% summarise(n=n()) %>% ungroup() %>% arrange(-n)
  • Jaap
    Jaap over 8 years
    if you want to retain the whole dataframe, you will need to add in th freq := again
  • Jaap
    Jaap over 8 years
    if you want to retain the whole dataframe, then you need to use mutate instead of summarise