How to select columns conditionally in a data frame in R

15,656

Solution 1

You're looking for aggregate. Here is a forumla that returns the median age and weight by sex:

aggregate(cbind(age, weight) ~ sex, data=jalal, FUN=median)
##   sex  age weight
## 1   F 20.5  189.9
## 2   M 21.0  198.1

To get a data frame containing just the women, here is the syntax for [:

jalal[jalal$sex == 'F',]

Note the quotes around 'F'. A bare F means FALSE. That's why your second subset expression fails.

subset(jalal, subset=(sex =='F'))
##    age sex weight eye.color hair.color
## 1   23   F   93.8      blue      black
## 3   22   F  196.5     hazel       gray
## 6   16   F  152.1      blue       gray

...

In the comment, it is requested for a method for the mean values for women with blue eyes. The first approach is to filter the data frame to just blue-eyed people:

aggregate(cbind(age, weight) ~ sex, data=jalal[jalal$eye.color == 'blue',], FUN=mean)
##   sex      age   weight
## 1   F 19.66667 151.7667
## 2   M 18.00000 212.8500

But this seems hackish, after all, we're not filtering the data frame on women. So here is a formula that gives the mean age and weight, by sex and eye color. From this, you can find the mean of blue-eyed women, green-eyed men, etc.:

aggregate(cbind(age, weight) ~ sex + eye.color, data=jalal, FUN=mean)
##   sex eye.color      age   weight
## 1   M     amber 21.50000 218.5000
## 2   F      blue 19.66667 151.7667
## 3   M      blue 18.00000 212.8500
## 4   M     brown 19.33333 194.9000
## 5   F      gray 19.00000 194.6333
## 6   M      gray 23.00000 198.2000
## 7   F     green 18.50000 221.0500
## 8   M     green 21.50000 183.5500
## 9   F     hazel 21.50000 176.9500

Note rows 2 and 3 here match the results in the prior expression.

Solution 2

Here's an alternative solution using the data.table package:

require(data.table)
jalal <- as.data.table(jalal)

To subset on females:

jalal[sex == "F"]

To calculate the mean, median, etc:

> jalal[sex == "F", mean(weight)]
[1] 183.52
> jalal[sex == "F", list(mean(weight), median(age))]
       V1   V2
1: 183.52 20.5
Share:
15,656
Mona Jalal
Author by

Mona Jalal

contact me at [email protected] I am a 5th-year computer science Ph.D. Candidate at Boston University advised by Professor Vijaya Kolachalama in computer vision as the area of study. Currently, I am working on my proposal exam and thesis on the use of efficient computer vision and deep learning for cancer detection in H&amp;E stained digital pathology images.

Updated on June 04, 2022

Comments

  • Mona Jalal
    Mona Jalal almost 2 years

    How can I find the mean/median (any other such thing) of women? I have tried a few piece of code to access the women data in particular but was unsuccessful. Any help is really appreciated.

    > jalal <- read.csv("jalal.csv", header=TRUE,sep=",")
    > which(jalal$sex==F)
    integer(0)
    > jalal
       age sex weight eye.color hair.color
    1   23   F   93.8      blue      black
    2   21   M  180.8     amber       gray
    3   22   F  196.5     hazel       gray
    4   22   M  256.2     amber      black
    5   21   M  219.6      blue       gray
    6   16   F  152.1      blue       gray
    7   21   F  183.3      gray   chestnut
    8   18   M  179.1     brown      blond
    9   15   M  206.1      blue      white
    10  19   M  211.6     brown      blond
    11  20   F  209.4      blue      white
    12  21   M  194.0     brown     auburn
    13  22   F  204.1     green      black
    14  21   F  157.4     hazel        red
    15  15   F  238.0     green       gray
    16  20   F  154.8      gray       gray
    17  16   F  245.8      gray       gray
    18  23   M  198.2      gray        red
    19  19   M  169.1     green      brown
    20  24   M  198.0     green       gray
    > subset(jalal, subset=(sex =F)) -> females
    > females
    [1] age        sex        weight     eye.color  hair.color
    <0 rows> (or 0-length row.names)
    > subset(jalal, subset=(sex ==F)) -> females
    > females
    [1] age        sex        weight     eye.color  hair.color
    <0 rows> (or 0-length row.names)
    

    Here's what's in jalal.csv:

    "age","sex","weight","eye.color","hair.color"
    23,"F",93.8,"blue","black"
    21,"M",180.8,"amber","gray"
    22,"F",196.5,"hazel","gray"
    22,"M",256.2,"amber","black"
    21,"M",219.6,"blue","gray"
    16,"F",152.1,"blue","gray"
    21,"F",183.3,"gray","chestnut"
    18,"M",179.1,"brown","blond"
    15,"M",206.1,"blue","white"
    19,"M",211.6,"brown","blond"
    20,"F",209.4,"blue","white"
    21,"M",194,"brown","auburn"
    22,"F",204.1,"green","black"
    21,"F",157.4,"hazel","red"
    15,"F",238,"green","gray"
    20,"F",154.8,"gray","gray"
    16,"F",245.8,"gray","gray"
    23,"M",198.2,"gray","red"
    19,"M",169.1,"green","brown"
    24,"M",198,"green","gray"
    
  • Mona Jalal
    Mona Jalal over 10 years
    Also I was wondering if the fun can count instead of just mean/median/weighted mean! Like how can I use aggregate to count number of people who have brown or black eyes!? I couldn't find a function for counting in ?aggregate--Basically I want to know how to find a list of "fun" functions in aggregate
  • Matthew Lundberg
    Matthew Lundberg over 10 years
    A count is a vector length in R. Pass FUN=length for this. It's easiest to create a column of 1's (jalal$count <- 1) and use count in place of cbind(age, weight) in the formula.
  • Matthew Lundberg
    Matthew Lundberg over 10 years
    You can name the columns: list(MeanWeight=mean(weight), MedianAge=median(age))
  • Scott Ritchie
    Scott Ritchie over 10 years
    Thanks! I'm still in the process of learning the data.table syntax.
  • Mona Jalal
    Mona Jalal over 10 years
    @Mathew Lundberg: Can I find how old is the third heaviest person using aggregate function? I was trying this but it wasn't helpful: > aggregate( age~weight, data=jalal, FUN=rank)