Sort a dataframe column by the frequency of occurrence

r sorting dataframe

11,592

Solution 1

One possible approach is to use data.table to add freq column, then sort your data accordingly:

library(data.table)
setDT(df)[,freq := .N, by = c("Region","Salary")]

# Sort
df[order(freq, decreasing = T),]

# As a oneliner (thx @Jaap)
setDT(df)[, freq := .N, by = .(Region,Salary)][order(-freq)]

Solution 2

Base R

df <- transform(df, freq= ave(seq(nrow(df)), Salary, FUN=length))
df[order(-df$freq), ]

Output:

   Region  ID Salary freq
6       2  A6   1002    3
7       2  A7   1002    3
8       2  A8   1002    3
2       1  A2   1001    2
5       1  A5   1001    2
9       3  A9   3001    2
10      3 A10   3001    2
1       1  A1    100    1
3       1  A3   2000    1
4       1  A4   2431    1
11      3 A11   4001    1

dplyr

library(dplyr)
df %>%
  add_count(Salary) %>% 
  arrange(desc(n))

Output:

  Region  ID Salary n
1       2  A6   1002 3
2       2  A7   1002 3
3       2  A8   1002 3
4       1  A2   1001 2
5       1  A5   1001 2
6       3  A9   3001 2
7       3 A10   3001 2
8       1  A1    100 1
9       1  A3   2000 1
10      1  A4   2431 1
11      3 A11   4001 1

Solution 3

library(dplyr)
data %>% group_by(Region, Salary) %>% summarise(n=n()) %>% ungroup() %>% arrange(-n)

(edited: incloude commentary on the ungroupfrom Jaap, good point)

11,592

Author by

Anubhav Dikshit

Updated on June 05, 2022

Comments

Anubhav Dikshit almost 2 years
I have a dataframe in called df, there are three column lets say,
```
Region ID  Salary
1      A1  100
1      A2  1001
1      A3  2000
1      A4  2431
1      A5  1001
..............
..............
2      A6  1002
2      A7  1002
2      A8  1002
3      A9  3001
3      A10 3001
3      A11 4001
```
Now I want to sort column Salary by the occurrence of them by Region, that is using frequency table or something, get the probability of occurrence per region and sort them. Please assume that the dataset is large enough (1000 rows)

P.S: Can anyone suggest a good method to do some. Please use column name in your answers since the real table has some column in the middle

Thanks in advance
```
                  **EDIT 1**
```
I think I was not clear enough, thanks for all those who replied, I sincerely apologise for not being clear:

With the current dataset we need to create a frequency table say:
```
Region  Salary(bin)     Count
1       1K              6                   
1       5K              3                   
1       2K              2                   
1       15K             2                   
1       0.5K            2                   
1       24K             1                   
1       0K              0                   
```
using this we can classify add a new columns in our data frame df called bin(bucket from histogram)
```
Region     ID  Salary  (bin)   Count
    1      A1  100     1K      6
    1      A2  1001    2K      2
    1      A3  2000    2K      2
    1      A4  2431    5K      3
```
..........................So on...............

We can do the above using:
```
df$bin <- cut(df$salary, breaks=hist(df$salary)$breaks)
```
After sorting by Region and Count and Salary we get:
```
Region     ID  Salary  (bin)   Count
    1      A1  100     1K      6
    1      A4  2431    5K      3
    1      A3  2000    2K      2
    1      A2  1001    2K      2
```
As you can see, we need to create frequency table for each region and do sort. I did the above using Tableau but I want to automate this in R

Hope I was clear
- Anubhav Dikshit over 8 years
  
  Does not this just sort by the numerical value instead of the frequency of occurance?
- mtoto over 8 years
  
  you can still use the below approaches as far as I can see.
- Anubhav Dikshit over 8 years
  
  Yes, but I agree ( hence the upvotes), was hoping for a more elegant solution than mine
Jaap over 8 years

even shorter: setDT(df)[, .N, by = .(Region,Salary)][order(-N)]
Jaap over 8 years

or: df %>% group_by(Region, Salary) %>% summarise(n=n()) %>% ungroup() %>% arrange(-n)
Jaap over 8 years

if you want to retain the whole dataframe, you will need to add in th freq := again
Jaap over 8 years

if you want to retain the whole dataframe, then you need to use mutate instead of summarise