How do I run apply on a data.table?
Solution 1
Completely rewritten on 2015-11-24, to fix an error in previous versions.
Also added more modern options on 2019-09-27
You have a few options.
Process all of the target columns with an embedded call to
lapply()
, using:=
to assign the modified values in place. This relies on:=
's very handy support for simultaneous assignment to several column named on its LHS.Use a
for
loop to run through the target columns one at a time, usingset()
to modify the value of each one in turn.Use a
for
loop to iterate over multiple "naive" calls to[.data.table()
, each one of which modifies a single column.
These methods all seem about equally fast, so which one you use will be mostly a matter of taste. (1) is nicely compact and expressive. It's what I most often use, though you may find (2) easier to read. Because they process and modify the columns one at a time, (2) or (3) will have an advantage in the rare situation in which your data.table is so large that you are in danger of running up against limits imposed by your R session's available memory.
library(data.table)
## Create three identical 1000000-by-20 data.tables
DT1 <- data.table(1:1e6,
as.data.table(replicate(1e6, paste(sample(letters, nr, TRUE),
sample(letters, nr, TRUE)))))
cnames <- c("ID", paste0("X", 1:19))
setnames(DT1, cnames)
DT2 <- copy(DT1); DT3 <- copy(DT1)
## Method 1
system.time({
DT1[, .SDcols=cnames[-1L], cnames[-1L] :=
lapply(.SD, function(x) gsub(" ", "_", x, fixed=TRUE)), ]
})
## user system elapsed
## 10.90 0.11 11.06
## Method 2
system.time({
for(cname in cnames[-1]) {
set(DT2, j=cname, value=gsub(" ", "_", DT2[[cname]], fixed=TRUE))
}
})
## user system elapsed
## 10.65 0.05 10.70
## Method 3
system.time({
for(cname in cnames[-1]) {
DT3[ , (cname) := gsub(" ", "_", get(cname), fixed=TRUE)]
}
})
## user system elapsed
## 10.33 0.03 10.37
For more details on set()
and :=
, read their help page, gotten by typing ?set
or ?":="
.
Solution 2
You can do this:
library("stringr")
dt[, -1] <- lapply(dt[, -1], function(x) str_replace(x," ","_"))
Maiasaura
I'm research assistant professor at UC Berkeley. I use R extensively in my work and as part of a side project, develop R based tools to facilitate open, reproducible science (see http://ropensci.org).
Updated on September 27, 2020Comments
-
Maiasaura over 3 years
I have a
data.table
with columns 2 through 20 as strings with spaces (e.g., "Species Name"). I want to runstr_replace()
on all those columns simultaneously so all the "Species Name" become "Species_Name". I can either do:data.table(apply(as.data.frame(dt[,2:dim(dt)[2], with=F]), 2, function(x){ str_replace(x," ","_") }))
or if I keep it as a
data.table
object, then I can do this one column at a time:dt[,SpeciesName := str_replace(SpeciesName, " ", "_")
How do I do this for all columns 2 through the end similar to the one of the above?
-
Matt Dowle over 12 yearsThis is an interesting case. Here, 19 columns out of 20 are being replaced; the RHS of
:=
is almost all the table. The advantage of:=
is greater when, say, one or two columns are added to 20, or one or two columns of 20 are modified. In those cases most of the columns are left in place and:=
is much faster than copying the whole table. -
Matt Dowle over 12 yearsAlso,
set()
is a new function in v1.8.0 (not yet on CRAN) which provides:=
functionality directly.set()
can be much faster than:=
when assigning within a loop (for easier more natural programming). There's an example in latest NEWS on homepage. -
Josh O'Brien over 12 years@MatthewDowle -- Thanks for your comments. They reminded me that, over the weekend, I'd had a nagging feeling about the answer I'd given here, and spurred me to revisit this. Obviously I had good reason to feel nagged. Please have a look at my revised answer. Also please add any of the suggestions you've got in your comments to the text of my answer, where you think they might help. I'll have a look at
set()
, but don't yet feel qualified to discuss it. And, once again, thanks for all of the work you put into continuing development of the data.table package! -
Matt Dowle over 12 yearsAaah, that's better :-) If I could +2 I would. On the RHS of
:=
perhaps tryDT1[[cname]]
instead ofDT1[,cname,with=FALSE]
.[[
doesn't copy (until write) so should be faster than[.data.table
at whole column extract (untested guess). -
Josh O'Brien over 12 years@MatthewDowle -- I tried your suggestion of using
DT1[[cname]]
, and it was much slower than usingDT1[,cname,with=FALSE]
, basically as slow as my earlier'lapply'
solution. Very interesting. I'll think about this, but if you have an idea of why that is, I'd be very curious to hear your thoughts. Cheers. -
Matt Dowle over 12 yearsOk just ran it. The
gsub
call is being passed a 1-columndata.table
in the example so it's fast but doesn't seem to be doing the right thing. That's an example where thedrop=TRUE
would come in handy, but in the meantime have to add[[1]]
afterwards. Also theDT1 <- DT2 <- DT3
wouldn't actually copy on write, may need to beDT1 <- copy(DT2) <- copy(DT3)
, although the example doesn't seem to use those. -
Josh O'Brien over 12 years@MatthewDowle -- OK, I incorporated both of those changes, and redid the timings. It all looks good now, and is hopefully a useful reference for others. Thanks.
-
Matt Dowle over 12 yearsDoesn't seem right. The
DT1[[cname]][[1]]
is getting the first item only which may be why it's so fast, then:=
is recycling that down the column. It appears to work in this case because the example is the same string. Should be justDT1[[cname]]
then it'll be the correct (slower) speed. -
Josh O'Brien over 12 years@MatthewDowle -- I agree that this isn't right, but I'm all turned around re: what the best method is. I think I should revert this to my original answer, where it was at revision 6, after you made the minor edit correcting the the data.table version I referenced from 1.7.10 to 1.7.8. Please advise me on this, and I'll perform the rollback or otherwise edit the post. (Seems like the takehome here is that being able to put multiple columns on the LHS of
:=
can be useful, but not so much if you are replacing nearly all of the columns). -
Matt Dowle over 12 yearsYes that's the takehome (my first comment above as it turns out ;)) However, there are still two big advantages of multiple
LHS
: i) deleting many columns (instantly) using:=NULL
and ii) if the 20 column table is 50GB in RAM then looping:=
by column will only need one column's worth of working memory at a time to get the job done, whereas anlapply
'dRHS
needs a full copy (maybe even two full copies) and may well return the dreadedout of memory
error. -
Konstantinos over 8 yearsDefining DT1 returns an error:
DT1 <- data.table(1:1e6, + as.data.table(replicate(1e6, paste(sample(letters, nr, TRUE), + sample(letters, nr, TRUE))))) Error in sample.int(length(x), size, replace, prob) : object 'nr' not found
-
Konstantinos over 8 yearsProposed solution:
DT1 <- data.table(1:1e6, as.data.table(replicate(19, paste(sample(letters, 2, TRUE), sample(letters, 2, TRUE)))))
-
Hack-R about 7 yearsSorry, what's
nr
? Code breaks on that when I try to run it -
amonk almost 7 years@Hack-R nr=20 deciphering the comments