idata.frame: Why error "is.data.frame(df) is not TRUE"?

10,869

Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.

Specifically the lapply(.SD,FUN) and .SDcols arguments with by

Setup the data.table

library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)

Which columns are numeric

numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))



dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase, 
    fCycle), .SDcols = numeric_columns]

some benchmarking

library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime, 
    fPhase, fCycle), .SDcols = numeric_columns], 
 plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE), 
 idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist), 
        smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur), 
        smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct), 
        entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct), 
        larct = median(x$larct), na.rm = TRUE)), 
 aggregate = aggregate(exp[, numeric_columns],
                       exp[, c("groupname", "starttime", "fPhase", "fCycle")], 
              median), 
 replications = 5)

##         test replications elapsed relative user.self 
## 4  aggregate            5    5.42    1.789      5.30   
## 1 data.table            5    3.03    1.000      3.03    
## 3 idataframe            5   11.81    3.898     11.77       
## 2       plyr            5    9.47    3.125      9.45       
Share:
10,869
dnagirl
Author by

dnagirl

Updated on June 09, 2022

Comments

  • dnagirl
    dnagirl almost 2 years

    I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.

    My original call, slow but it works:

    df.median<-ddply(exp, 
                     .(groupname,starttime,fPhase,fCycle), 
                     numcolwise(median), 
                     na.rm=TRUE)
    

    With idata.frame, Error: is.data.frame(df) is not TRUE

    library(plyr)
    df.median<-ddply(idata.frame(exp), 
                     .(groupname,starttime,fPhase,fCycle), 
                     numcolwise(median), 
                     na.rm=TRUE)
    

    So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:

    bb.median<-ddply(idata.frame(baseball), 
                     .(id,year,team), 
                     numcolwise(median), 
                     na.rm=TRUE)
    >Error: is.data.frame(df) is not TRUE
    

    Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?

    ETA:

    I also tried:

    groupVars <- c("groupname","starttime","fPhase","fCycle")
    voi<-c('inadist','smldist','lardist')
    
    i<-idata.frame(exp)
    ag.median <- aggregate(i[,voi], i[,groupVars], median)
    Error in i[, voi] : object of type 'environment' is not subsettable
    

    which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.