How do I filter a data.frame in R by categorical variable?

26,366

Solution 1

You can get a subset of your data by indexing or using subset:

ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))

subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")

ex0331[ex0331$supplement=="Fe3",]

Or at once with split, resulting in a list:

split(ex0331,ex0331$supplement)

Another thing you can do is use tapply to split by a factor and then perform a function:

tapply(ex0331$iron,ex0331$supplement,mean)
        Fe3         Fe4 
-0.15443861 -0.01308835 

The plyr package can also be used, which has loads of useful functions. For example:

library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
        Fe3         Fe4 
-0.15443861 -0.01308835 

Edit

In response to edited question, you could get the log of iron per supplement with:

ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))

tapply(ex0331$iron,ex0331$supplement,log)

Or with plyr:

library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))

Both returned in a list. I'm sure there is an easier way then the wrapper function in the plyr example though.

Solution 2

I'd recommend using ddply function from the plyr package, detailed doc is online:

> require(plyr)
> ddply( ex0331, .(Supplement), summarise, 
         mean = mean(Iron), 
         sd = sd(Iron), 
         len = length(Iron))

  Supplement       mean        sd len
1        Fe3 -0.3749169 0.2827360   4
2        Fe4  0.1953116 0.7128129   6

Update. To add a LogIron column where each entry is the log() of the Iron value, you would simply use transform:

> transform(ex0331, LogIron = log(Iron))

         Iron Supplement     LogIron
1  0.07185141        Fe3 -2.63315498
2  1.10367297        Fe3  0.09864368
3  0.48592428        Fe3 -0.72170246
4  0.20286918        Fe3 -1.59519393
5  0.80830682        Fe4 -0.21281357

Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do:

> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
  Supplement    meanLog
1        Fe3 -1.0062304
2        Fe4  0.2791507
Share:
26,366
Stephen O'Grady
Author by

Stephen O'Grady

Updated on December 09, 2020

Comments

  • Stephen O'Grady
    Stephen O'Grady over 3 years

    Just learning R.

    Given a data.frame in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame for usage?

    str(ex0331)
    'data.frame':   36 obs. of  2 variables:
    $ Iron      : num  0.71 1.66 2.01 2.16 2.42 ...
    $ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...
    

    Basically, I need to be able to operate on the two factors separately; i.e. I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by Supplement type (Fe3 or Fe4).

    What's the easiest way to accomplish this?

    I'm aware of the by() command. For example, the following gets some of what I need:

    by(ex0331, ex0331$Supplement, summary)
    ex0331$Supplement: Fe3
         Iron       Supplement
    Min.   :0.710   Fe3:18    
    1st Qu.:2.420   Fe4: 0    
    Median :3.475             
    Mean   :3.699             
    3rd Qu.:4.472             
    Max.   :8.240             
    ------------------------------------------------------------ 
    ex0331$Supplement: Fe4
         Iron        Supplement
    Min.   : 2.200   Fe3: 0    
    1st Qu.: 3.892   Fe4:18    
    Median : 5.750             
    Mean   : 5.937             
    3rd Qu.: 6.970             
    Max.   :12.450      
    

    But I need more flexibility. I need to apply axis commands, for example, or log() functions by group. I'm sure there's an easy way to do this; I just don't see it. All of the data.frame manipulation documentation I've seen is for numerical rather than categorical variables.