Why is `[` better than `subset`?
Solution 1
This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset
(and functions like it) [here]. Go read it!
It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":
Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) {
scramble(subset(x, condition))
}
subscramble(mtcars, cyl == 4)
This returns the error:
Error in eval(expr, envir, enclos) : object 'cyl' not found
because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:
cyl <- 4
subscramble(mtcars, cyl == 4)
cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)
(Run them and see for yourself, it's pretty crazy.)
Solution 2
Also [
is faster:
require(microbenchmark)
microbenchmark(subset(airquality, Month == 8 & Temp > 90),airquality[airquality$Month == 8 & airquality$Temp > 90,])
Unit: microseconds
expr min lq median uq max neval
subset(airquality, Month == 8 & Temp > 90) 301.994 312.1565 317.3600 349.4170 500.903 100
airquality[airquality$Month == 8 & airquality$Temp > 90, ] 234.807 239.3125 244.2715 271.7885 340.058 100
flodel
Updated on November 15, 2020Comments
-
flodel over 3 years
When I need to filter a data.frame, i.e., extract rows that meet certain conditions, I prefer to use the
subset
function:subset(airquality, Month == 8 & Temp > 90)
Rather than the
[
function:airquality[airquality$Month == 8 & airquality$Temp > 90, ]
There are two main reasons for my preference:
I find the code reads better, from left to right. Even people who know nothing about R could tell what the
subset
statement above is doing.Because columns can be referred to as variables in the
select
expression, I can save a few keystrokes. In my example above, I only had to typeairquality
once withsubset
, but three times with[
.
So I was living happy, using
subset
everywhere because it is shorter and reads better, even advocating its beauty to my fellow R coders. But yesterday my world broke apart. While reading thesubset
documentation, I notice this section:Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Could someone help clarify what the authors mean?
First, what do they mean by "for use interactively"? I know what an interactive session is, as opposed to a script run in BATCH mode but I don't see what difference it should make.
Then, could you please explain "the non-standard evaluation of argument subset" and why it is dangerous, maybe provide an example?
-
Heisenberg over 10 yearsMay I have some newbie questions for clarification? When we write
subset(mtcars, cyl == 4)
(at top level), where does R look for cyl? If it looks into themtcars
object that is passed tosubset()
, then shouldn't it be able to findcyl
even ifscramble
is within another function, sincemtcars
is still being passed to it? If my question doesn't make sense, you could just elaborate more on why R can no longer findcyl
. Thanks! -
joran over 10 years@Anh Inside
subset.data.frame
, the thing we're trying to evaluate at that point is justcondition
. That doesn't exist inmtcars
. Sosubset.data.frame
usesenclos = parent.frame()
to ensure thatcondition
is correctly evaluated ascyl == 4
. But then we've popped back up to the enclosing frame, and now when R looks forcyl
it is no longer looking inside ofmtcars
. If we didn't useenclos
, something likesubset(mtcars,cyl == a)
wouldn't work at all. -
flodel about 10 yearsYes and no. I think the time difference you are seeing is due to two things. 1) a small (< 100 microseconds) overhead and 2)
subset
unlike[
removes rows where the filter evaluates toNA
. Do this and you'll see that they are both as fast when compared "fairly":x <- do.call(rbind, rep(list(airquality), 100)); microbenchmark(subset(x, Month == 8 & Temp > 90),{ i <- x$Month == 8 & x$Temp > 90; x[!is.na(i) & i ,] })
-
3pitt over 6 yearsdoes anyone know why subset() wouldn't just implement the faster and safer [,] method behind the scenes?
-
joran over 6 years@MikePalmice It does. The last line of
subset.data.frame
isx[r, vars, drop = drop]
. The problem is how to get from the unquotedsubset
andselect
arguments to something that you can validly pass to[.data.frame
. -
3pitt over 6 years@joran got it, thanks. how do you think about whether to use dplyr's filter instead of
[]
? -
tjebo almost 6 yearsThis is such an old question/answer with so many upvotes - so clearly I am overlooking something?? For me, your example code doesn't work on it's own. Hadley's example contains the pre-creation of another function called 'subset2'... The important difference between
[
andsubset()
lies then within this function... -
joran almost 6 years@Tjebo The example code in my answer works exactly as I describe for me in a clean R (3.4.3) session, as of 5 minutes ago.
-
tjebo almost 6 yearsThanks for checking. I might have misunderstood the intention of your code. But replacing it with subset using
[
, this results in the same 'weird' result as your code usingsubset
- at least here :/ Also clean R 3.4.3