idata.frame: Why error "is.data.frame(df) is not TRUE"?
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table
.
Specifically the lapply(.SD,FUN)
and .SDcols
arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
dnagirl
Updated on June 09, 2022Comments
-
dnagirl almost 2 years
I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE)
With idata.frame,
Error: is.data.frame(df) is not TRUE
library(plyr) df.median<-ddply(idata.frame(exp), .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the
baseball
dataset. Theidata.frame
example works fine:dlply(idata.frame(baseball), "id", nrow)
But if I try something similar to my desired call usingbaseball
, it doesn't work:bb.median<-ddply(idata.frame(baseball), .(id,year,team), numcolwise(median), na.rm=TRUE) >Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle") voi<-c('inadist','smldist','lardist') i<-idata.frame(exp) ag.median <- aggregate(i[,voi], i[,groupVars], median) Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.