Fast way to replace all blanks with NA in R data.table
Solution 1
Here's probably the generic data.table
way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor
or character
ones, because other classes won't accept blank values.
For factor
s
indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)
For character
s
indx2 <- which(sapply(data, is.character))
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
Solution 2
Use this approach:
system.time(data[data==''|data==' ']<-NA)
user system elapsed
1.47 0.19 1.66
system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
user system elapsed
3.41 0.20 3.64
Solution 3
Assuming you had mistake while populating your data, below is the solution using data.table which you used in tag.
library(data.table)
data = data.table(cats=rep(c('', ' ', 'meow'),1000000),dogs=rep(c("woof", " ", NA),1000000))
system.time(data[cats=='', cats := NA][dogs=='', dogs := NA])
# user system elapsed
# 0.056 0.000 0.059
If you have a lot of column see David's comment.
Solution 4
After trying a few different ways to do this, I found the quickest and simplest option to be:
data[data==""] <- NA
Tim_Utrecht
Updated on June 12, 2022Comments
-
Tim_Utrecht almost 2 years
I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:
data = data.frame(cats=rep(c('', ' ', 'meow'),1e6), dogs=rep(c("woof", " ", NA),1e6)) system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
Is there a more data.table fast way to achieve this?
Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:
DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='') system.time(DT[DT=='']<-NA)
-
Tim_Utrecht almost 9 yearsThanks, I get the
Error in charToDate(x) : character string is not in a standard unambiguous format
error. Will try to solve it and come back if it works as expected! -
David Arenburg almost 9 yearsYou can't solve this with just
''
as you have different types of blanks in the data. -
Colonel Beauvel almost 9 years?? which data are you using? not the set of data of your example I guess ;)
-
Colonel Beauvel almost 9 yearswell maybe the OP wants to keep ' '. @Tim, can you give more details on this?
-
David Arenburg almost 9 yearsOP used a regex in order to remove them
gsub("^$|^ $", NA, x)
(which is pretty much self explanatory) -
Tim_Utrecht almost 9 yearsThanks a lot (again) @David. Worked in 4.79 seconds!
-
Tim_Utrecht almost 9 yearsnow I get the NA values as <NA> which are not recognized by the
is.na()
function. If I change it tosystem.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA))
it fails, what you probably also encountered? Is there a way of setting the values to the 'standard' NA? -
Tim_Utrecht almost 9 yearsIsn't it strange that the NA_character is recognized in the
is.na(NA_character_)
but not if I use this function for the data.table where I just replaced blanks with that NA_character_:data[is.na(DEFAULT_DATE)]->dataNA
. Or am I missing something? PS. see edit in Question for the new data. -
David Arenburg almost 9 yearsOk, try this
indx <- which(sapply(data, is.factor)); system.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)); indx2 <- which(sapply(data, is.character)); system.time(for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_))
-
Oriol Prat almost 6 years@DavidArenburg I don't know why but I'm testing
data[data=='']<-NA
versusfor (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
withmicrobenchmark
package and tells me first option is much faster. -
David Arenburg almost 6 years@OriolPrat you are comparing apples with oranges. Running regex vs doing an exact match has nothing to do with data.table. The point of this answer was to show how to use OPs regex using correct data.table syntax. In this case regex wasn't probably needed in the first place. Although
data[data=='']<-NA
doesn't replace the" "
values.