Fast way to replace all blanks with NA in R data.table

10,176

Solution 1

Here's probably the generic data.table way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.

For factors

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

For characters

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)

Solution 2

Use this approach:

system.time(data[data==''|data==' ']<-NA)
  user  system elapsed 
  1.47    0.19    1.66 

system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
  user  system elapsed 
  3.41    0.20    3.64

Solution 3

Assuming you had mistake while populating your data, below is the solution using data.table which you used in tag.

library(data.table)
data = data.table(cats=rep(c('', ' ', 'meow'),1000000),dogs=rep(c("woof", " ", NA),1000000))
system.time(data[cats=='', cats := NA][dogs=='', dogs := NA])
#  user  system elapsed 
# 0.056   0.000   0.059 

If you have a lot of column see David's comment.

Solution 4

After trying a few different ways to do this, I found the quickest and simplest option to be:

data[data==""] <- NA
Share:
10,176
Tim_Utrecht
Author by

Tim_Utrecht

Updated on June 12, 2022

Comments

  • Tim_Utrecht
    Tim_Utrecht almost 2 years

    I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:

     data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                       dogs=rep(c("woof", " ", NA),1e6))
     system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
    

    Is there a more data.table fast way to achieve this?

    Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:

    DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
    system.time(DT[DT=='']<-NA)
    
  • Tim_Utrecht
    Tim_Utrecht almost 9 years
    Thanks, I get the Error in charToDate(x) : character string is not in a standard unambiguous format error. Will try to solve it and come back if it works as expected!
  • David Arenburg
    David Arenburg almost 9 years
    You can't solve this with just '' as you have different types of blanks in the data.
  • Colonel Beauvel
    Colonel Beauvel almost 9 years
    ?? which data are you using? not the set of data of your example I guess ;)
  • Colonel Beauvel
    Colonel Beauvel almost 9 years
    well maybe the OP wants to keep ' '. @Tim, can you give more details on this?
  • David Arenburg
    David Arenburg almost 9 years
    OP used a regex in order to remove them gsub("^$|^ $", NA, x) (which is pretty much self explanatory)
  • Tim_Utrecht
    Tim_Utrecht almost 9 years
    Thanks a lot (again) @David. Worked in 4.79 seconds!
  • Tim_Utrecht
    Tim_Utrecht almost 9 years
    now I get the NA values as <NA> which are not recognized by the is.na() function. If I change it to system.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA)) it fails, what you probably also encountered? Is there a way of setting the values to the 'standard' NA?
  • Tim_Utrecht
    Tim_Utrecht almost 9 years
    Isn't it strange that the NA_character is recognized in the is.na(NA_character_) but not if I use this function for the data.table where I just replaced blanks with that NA_character_: data[is.na(DEFAULT_DATE)]->dataNA. Or am I missing something? PS. see edit in Question for the new data.
  • David Arenburg
    David Arenburg almost 9 years
    Ok, try this indx <- which(sapply(data, is.factor)); system.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)); indx2 <- which(sapply(data, is.character)); system.time(for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_))
  • Oriol Prat
    Oriol Prat almost 6 years
    @DavidArenburg I don't know why but I'm testing data[data=='']<-NA versus for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_) with microbenchmark package and tells me first option is much faster.
  • David Arenburg
    David Arenburg almost 6 years
    @OriolPrat you are comparing apples with oranges. Running regex vs doing an exact match has nothing to do with data.table. The point of this answer was to show how to use OPs regex using correct data.table syntax. In this case regex wasn't probably needed in the first place. Although data[data=='']<-NA doesn't replace the " " values.