How do I filter a data.frame in R by categorical variable?
Solution 1
You can get a subset of your data by indexing or using subset
:
ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))
subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")
ex0331[ex0331$supplement=="Fe3",]
Or at once with split
, resulting in a list:
split(ex0331,ex0331$supplement)
Another thing you can do is use tapply
to split by a factor and then perform a function:
tapply(ex0331$iron,ex0331$supplement,mean)
Fe3 Fe4
-0.15443861 -0.01308835
The plyr
package can also be used, which has loads of useful functions. For example:
library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
Fe3 Fe4
-0.15443861 -0.01308835
Edit
In response to edited question, you could get the log of iron per supplement with:
ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))
tapply(ex0331$iron,ex0331$supplement,log)
Or with plyr
:
library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))
Both returned in a list. I'm sure there is an easier way then the wrapper function in the plyr example though.
Solution 2
I'd recommend using ddply
function from the plyr
package, detailed doc is online:
> require(plyr)
> ddply( ex0331, .(Supplement), summarise,
mean = mean(Iron),
sd = sd(Iron),
len = length(Iron))
Supplement mean sd len
1 Fe3 -0.3749169 0.2827360 4
2 Fe4 0.1953116 0.7128129 6
Update.
To add a LogIron
column where each entry is the log()
of the Iron
value, you would simply use transform
:
> transform(ex0331, LogIron = log(Iron))
Iron Supplement LogIron
1 0.07185141 Fe3 -2.63315498
2 1.10367297 Fe3 0.09864368
3 0.48592428 Fe3 -0.72170246
4 0.20286918 Fe3 -1.59519393
5 0.80830682 Fe4 -0.21281357
Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do:
> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
Supplement meanLog
1 Fe3 -1.0062304
2 Fe4 0.2791507
Stephen O'Grady
Updated on December 09, 2020Comments
-
Stephen O'Grady over 3 years
Just learning R.
Given a
data.frame
in R with two columns, one numeric and one categorical, how do I extract a portion of thedata.frame
for usage?str(ex0331) 'data.frame': 36 obs. of 2 variables: $ Iron : num 0.71 1.66 2.01 2.16 2.42 ... $ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...
Basically, I need to be able to operate on the two factors separately; i.e. I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by
Supplement
type (Fe3
orFe4
).What's the easiest way to accomplish this?
I'm aware of the
by()
command. For example, the following gets some of what I need:by(ex0331, ex0331$Supplement, summary) ex0331$Supplement: Fe3 Iron Supplement Min. :0.710 Fe3:18 1st Qu.:2.420 Fe4: 0 Median :3.475 Mean :3.699 3rd Qu.:4.472 Max. :8.240 ------------------------------------------------------------ ex0331$Supplement: Fe4 Iron Supplement Min. : 2.200 Fe3: 0 1st Qu.: 3.892 Fe4:18 Median : 5.750 Mean : 5.937 3rd Qu.: 6.970 Max. :12.450
But I need more flexibility. I need to apply
axis
commands, for example, orlog()
functions by group. I'm sure there's an easy way to do this; I just don't see it. All of thedata.frame
manipulation documentation I've seen is for numerical rather than categorical variables.