Find how many times duplicated rows repeat in R data frame
72,318
Solution 1
Here is solution using function ddply()
from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
Solution 2
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Solution 3
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Solution 4
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1
Related videos on Youtube
Author by
rose
Updated on July 26, 2020Comments
-
rose over 3 years
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4) b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7) df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
-
Eric Krantz about 2 yearsThis is not a duplicate of "Count number of rows within each group". This one is counting duplicates, the other question is counting how many in each group (and the rows in a group do not have to be duplicates of each other).
-
-
orizon over 10 yearsYou could save a few characters by replacing
function(x) nrow(x)
with justnrow
. -
Didzis Elferts over 10 years@orizon thanks, updated my answer.
-
maj almost 10 yearsIs it at all possible to recreate this with dplyr?
-
Didzis Elferts almost 10 years@maj I haven't used dplyr so can't answer
-
Daniel Chen almost 9 yearsdont forget about the pipe! df %>% group_by(a, b) %>% group_size()
-
DukeLover almost 7 yearsCould you please explain the reason behind replication
aggregate(list(numdup=rep(1,nrow(df))), df, length)
? -
thelatemail almost 7 years@dukelover - aggregate needs the column(s) being summed to be the same length as the grouping variables, so I just repeat 1 to get this.
-
DukeLover almost 7 yearsthanks a lot for your reply. Can you please explain this code
aggregate(numdup ~., data=transform(df,numdup=1), length)
? -- Here what is the significance ofnumdup ~
? -
3pitt over 6 yearsis there a solution that's agnostic to the columns a,b? (ie, use all columns)
-
PesKchan over 2 yearsyour first solution is terrific at the same time terrifying every-time i think of function its nightmare
-
jtr13 over 2 yearsOr
df %>% group_by_all() %>% count