How do I run apply on a data.table?

31,037

Solution 1

Completely rewritten on 2015-11-24, to fix an error in previous versions.

Also added more modern options on 2019-09-27

You have a few options.

  1. Process all of the target columns with an embedded call to lapply(), using := to assign the modified values in place. This relies on :='s very handy support for simultaneous assignment to several column named on its LHS.

  2. Use a for loop to run through the target columns one at a time, using set() to modify the value of each one in turn.

  3. Use a for loop to iterate over multiple "naive" calls to [.data.table(), each one of which modifies a single column.

These methods all seem about equally fast, so which one you use will be mostly a matter of taste. (1) is nicely compact and expressive. It's what I most often use, though you may find (2) easier to read. Because they process and modify the columns one at a time, (2) or (3) will have an advantage in the rare situation in which your data.table is so large that you are in danger of running up against limits imposed by your R session's available memory.

library(data.table)

## Create three identical 1000000-by-20 data.tables
DT1 <- data.table(1:1e6,
           as.data.table(replicate(1e6, paste(sample(letters, nr, TRUE),
                                             sample(letters, nr, TRUE)))))
cnames <- c("ID", paste0("X", 1:19))
setnames(DT1, cnames)
DT2 <- copy(DT1); DT3 <- copy(DT1)

## Method 1
system.time({
DT1[, .SDcols=cnames[-1L], cnames[-1L] := 
  lapply(.SD, function(x) gsub(" ", "_", x, fixed=TRUE)), ]
})
##   user  system elapsed 
##  10.90    0.11   11.06 

## Method 2
system.time({
    for(cname in cnames[-1]) {
        set(DT2, j=cname, value=gsub(" ", "_", DT2[[cname]], fixed=TRUE))
    }
})
##   user  system elapsed 
##  10.65    0.05   10.70 

## Method 3
system.time({
    for(cname in cnames[-1]) {
        DT3[ , (cname) := gsub(" ", "_", get(cname), fixed=TRUE)]
    }
})
##   user  system elapsed 
##  10.33    0.03   10.37 

For more details on set() and :=, read their help page, gotten by typing ?set or ?":=".

Solution 2

You can do this:

library("stringr")
dt[, -1] <- lapply(dt[, -1], function(x) str_replace(x," ","_"))
Share:
31,037
Maiasaura
Author by

Maiasaura

I'm research assistant professor at UC Berkeley. I use R extensively in my work and as part of a side project, develop R based tools to facilitate open, reproducible science (see http://ropensci.org).

Updated on September 27, 2020

Comments

  • Maiasaura
    Maiasaura over 3 years

    I have a data.table with columns 2 through 20 as strings with spaces (e.g., "Species Name"). I want to run str_replace() on all those columns simultaneously so all the "Species Name" become "Species_Name". I can either do:

    data.table(apply(as.data.frame(dt[,2:dim(dt)[2], with=F]), 2, 
                                   function(x){ str_replace(x," ","_") }))
    

    or if I keep it as a data.table object, then I can do this one column at a time:

    dt[,SpeciesName := str_replace(SpeciesName, " ", "_")
    

    How do I do this for all columns 2 through the end similar to the one of the above?

  • Matt Dowle
    Matt Dowle over 12 years
    This is an interesting case. Here, 19 columns out of 20 are being replaced; the RHS of := is almost all the table. The advantage of := is greater when, say, one or two columns are added to 20, or one or two columns of 20 are modified. In those cases most of the columns are left in place and := is much faster than copying the whole table.
  • Matt Dowle
    Matt Dowle over 12 years
    Also, set() is a new function in v1.8.0 (not yet on CRAN) which provides := functionality directly. set() can be much faster than := when assigning within a loop (for easier more natural programming). There's an example in latest NEWS on homepage.
  • Josh O'Brien
    Josh O'Brien over 12 years
    @MatthewDowle -- Thanks for your comments. They reminded me that, over the weekend, I'd had a nagging feeling about the answer I'd given here, and spurred me to revisit this. Obviously I had good reason to feel nagged. Please have a look at my revised answer. Also please add any of the suggestions you've got in your comments to the text of my answer, where you think they might help. I'll have a look at set(), but don't yet feel qualified to discuss it. And, once again, thanks for all of the work you put into continuing development of the data.table package!
  • Matt Dowle
    Matt Dowle over 12 years
    Aaah, that's better :-) If I could +2 I would. On the RHS of := perhaps try DT1[[cname]] instead of DT1[,cname,with=FALSE]. [[ doesn't copy (until write) so should be faster than [.data.table at whole column extract (untested guess).
  • Josh O'Brien
    Josh O'Brien over 12 years
    @MatthewDowle -- I tried your suggestion of using DT1[[cname]], and it was much slower than using DT1[,cname,with=FALSE], basically as slow as my earlier 'lapply' solution. Very interesting. I'll think about this, but if you have an idea of why that is, I'd be very curious to hear your thoughts. Cheers.
  • Matt Dowle
    Matt Dowle over 12 years
    Ok just ran it. The gsub call is being passed a 1-column data.table in the example so it's fast but doesn't seem to be doing the right thing. That's an example where the drop=TRUE would come in handy, but in the meantime have to add [[1]] afterwards. Also the DT1 <- DT2 <- DT3 wouldn't actually copy on write, may need to be DT1 <- copy(DT2) <- copy(DT3), although the example doesn't seem to use those.
  • Josh O'Brien
    Josh O'Brien over 12 years
    @MatthewDowle -- OK, I incorporated both of those changes, and redid the timings. It all looks good now, and is hopefully a useful reference for others. Thanks.
  • Matt Dowle
    Matt Dowle over 12 years
    Doesn't seem right. The DT1[[cname]][[1]] is getting the first item only which may be why it's so fast, then := is recycling that down the column. It appears to work in this case because the example is the same string. Should be just DT1[[cname]] then it'll be the correct (slower) speed.
  • Josh O'Brien
    Josh O'Brien over 12 years
    @MatthewDowle -- I agree that this isn't right, but I'm all turned around re: what the best method is. I think I should revert this to my original answer, where it was at revision 6, after you made the minor edit correcting the the data.table version I referenced from 1.7.10 to 1.7.8. Please advise me on this, and I'll perform the rollback or otherwise edit the post. (Seems like the takehome here is that being able to put multiple columns on the LHS of := can be useful, but not so much if you are replacing nearly all of the columns).
  • Matt Dowle
    Matt Dowle over 12 years
    Yes that's the takehome (my first comment above as it turns out ;)) However, there are still two big advantages of multiple LHS: i) deleting many columns (instantly) using :=NULL and ii) if the 20 column table is 50GB in RAM then looping := by column will only need one column's worth of working memory at a time to get the job done, whereas an lapply'd RHS needs a full copy (maybe even two full copies) and may well return the dreaded out of memory error.
  • Konstantinos
    Konstantinos over 8 years
    Defining DT1 returns an error: DT1 <- data.table(1:1e6, + as.data.table(replicate(1e6, paste(sample(letters, nr, TRUE), + sample(letters, nr, TRUE))))) Error in sample.int(length(x), size, replace, prob) : object 'nr' not found
  • Konstantinos
    Konstantinos over 8 years
    Proposed solution: DT1 <- data.table(1:1e6, as.data.table(replicate(19, paste(sample(letters, 2, TRUE), sample(letters, 2, TRUE)))))
  • Hack-R
    Hack-R about 7 years
    Sorry, what's nr? Code breaks on that when I try to run it
  • amonk
    amonk almost 7 years
    @Hack-R nr=20 deciphering the comments