count unique combinations of values

18,944

Solution 1

count in plyr package will do that task.

> df
  ID   value.1   value.2   value.3 value.4
1  1     M         D         F           A
2  2     F         M         G           B
3  3     M         D         F           A
4  4     L         D         E           B
> library(plyr)
> count(df[, -1])
    value.1   value.2   value.3 value.4 freq
1     F         M         G           B    1
2     L         D         E           B    1
3     M         D         F           A    2

Solution 2

N <- 10000

d <- data.frame(
  ID=seq(1, N), 
  v1=sample(c("M","F", "M", "L"), N, replace = TRUE), 
  v2=sample(c("D","M","D","D"), N, replace = TRUE), 
  v3=sample(c("F","G","F","E"), N, replace = TRUE),
  v4=sample(c("A","B","A","B"), N, replace = TRUE)
)

With data.table (fastest)

dt <- data.table::as.data.table(d)
dt[, .N, by = c('v1','v2','v3','v4')]

With dplyr

dplyr::count_(d, vars = c('v1','v2','v3','v4'))

With plyr

plyr::count(d, vars = c('v1','v2','v3','v4'))
plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow)

With aggregate (slowest)

aggregate(ID ~ ., d, FUN = length)

Benchmark

microbenchmark::microbenchmark(dt[, .N, by = c('v1','v2','v3','v4')],
                               plyr::count(d, vars = c('v1','v2','v3','v4')),
                               plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow),
                               dplyr::count_(d, vars = c('v1','v2','v3','v4')),
                               aggregate(ID ~ ., d, FUN = length), 
                               times = 1000)

Unit: microseconds
                                                         expr      min       lq      mean   median        uq        max neval  cld
                     dt[, .N, by = c("v1", "v2", "v3", "v4")]  887.807 1107.543  1263.777 1174.258  1289.724   4263.156  1000 a   
             plyr::count(d, vars = c("v1", "v2", "v3", "v4")) 3912.791 4270.387  5379.080 4498.053  5791.743 157146.103  1000   c 
 plyr::ddply(d, .variables = c("v1", "v2", "v3", "v4"), nrow) 7737.874 8553.370 10630.849 9018.266 11126.517 187301.696  1000    d
           dplyr::count_(d, vars = c("v1", "v2", "v3", "v4")) 2126.913 2432.957  2763.499 2568.251  2789.386  12549.669  1000  b  
                           aggregate(ID ~ ., d, FUN = length) 7395.440 8121.828 10546.659 8776.371 10858.263 210139.759  1000    d

It seems best to simply use data.table instead of the data.frame as it is fastest and doesn't need an other function or library to count. Note also that aggregate function performs much slower on large data sets.

Final note: feel free to update with new methods.

Solution 3

Without plyr.

aggregate(ID ~ ., d, FUN=length)# . means all variables in d except ID
Share:
18,944
Elisa
Author by

Elisa

Updated on June 04, 2022

Comments

  • Elisa
    Elisa almost 2 years

    My dataframe looks like this:

    ID | value 1 | value 2 | value 3 | value 4
    1  |    M    |    D    |    F    |   A
    2  |    F    |    M    |    G    |   B
    3  |    M    |    D    |    F    |   A
    4  |    L    |    D    |    E    |   B
    

    I want to get something like this.

    value 1 | value 2 | value 3 | value 4|  Number of combinations
      M     |    D    |    F    |   A    |     2
      F     |    M    |    G    |   B    |     1
      L     |    D    |    E    |   B    |     1
    

    e.g. to count the number of unique combinations of the columns value 1 - value 4.

  • Davor Josipovic
    Davor Josipovic almost 7 years
    Nice, but very (!!) slow compared to plyr::count. From microbenchmark test with 4000x500 DF summarizing frequencies for 3 columns, its seems that count is 20x (!) faster.
  • Frank
    Frank almost 7 years
    I think you should show the results for a somewhat larger data set. Also, I guess plyr::count(d, c('v1','v2','v3','v4')) might be correct, after looking at args(plyr::count). There is probably also a dplyr::count that could be considered.
  • Davor Josipovic
    Davor Josipovic almost 7 years
    @Frank, it is now based on 10k rows.
  • Frank
    Frank almost 7 years
    Ok, thanks. One other point: length(ID) doesn't seem like it would give the correct result now, since ID repeats values. It's generally a good idea to test for equality of results across the methods. Oh never mind, I guess it doesn't matter what values it has. Fwiw, the reason data.table is fast is documented in ?GForce.
  • Frank
    Frank almost 7 years
    They give a different format, but tapply(d$ID, d[, -1], length) and table(d[, -1]) are also fast for this particular case.