How do I create a new column based on multiple conditions from multiple columns?

40,346

Solution 1

We can use %in% for comparing multiple elements in a column, & to check if both conditions are TRUE.

library(dplyr)
df %>%
     mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") & 
           as.character(kids) == "Yes" & 
           as.numeric(as.character(distance)) < 10)+1] )

It is better to create the data.frame with stringsAsFactors=FALSE as by default it is TRUE. If we check the str(df), we can find that all the columns are factor class. Also, if there are missing values, instead of "", NA can be used to avoid converting the class of a numeric column to something else.

If we rewrite the creation of 'df'

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)

the above code can be simplified

df1 %>%
    mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
        kids == "Yes" &
        distance < 10)+1] )

For better understanding, some people prefer ifelse

df1 %>% 
   mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
                kids == "Yes" &
                distance < 10, 
                          "Yes", ""))

This can be also done easily with base R methods

df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
              kids == "Yes" & 
              distance < 10, 
                       "Yes", ""))

Solution 2

The solution is already pointed out by @akrun. I'd like to present it in a more 'wrapped up' way.

You can use the ifelse statement to create a column based on one (or more) conditions. But first you have to change the 'encoding' of missing values in the distance column. You used "" to indicate a missing value, this however converts the entire column to string and inhibits numerical comparison (distance < 10 is not possible). The R way of indicating a missing value is NA, your column definition of distance should be:

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)

The ifelse statement then looks like this:

df$get.flyer <- ifelse(
    ( 
        (df$commute %in% c("walk", "bike", "subway", "ferry")) &
        (df$kids == "Yes")                                     &
        (df$distance < 10)
    ),
    1,  # if condition is met, put 1
    0   # else put 0
)

Optional: Consider encoding your other columns in a different way as well:

  • you could use TRUE and FALSE instead of "Yes" and "No" for the kids variable
  • you could use a factor for commute

Solution 3

Example, check if first_column_name is contained in second_column_name and write result to new_column

df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))

Details:

df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.
Share:
40,346
Ankie
Author by

Ankie

Updated on July 17, 2022

Comments

  • Ankie
    Ankie almost 2 years

    I'm trying add a new column to a data frame based on several conditions from other columns. I have the following data:

    > commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
    > kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
    > distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
    > 
    > df = data.frame(commute, kids, distance)
    > df
       commute kids distance
    1     walk  Yes        1
    2     bike  Yes       12
    3   subway   No        5
    4    drive   No       25
    5    ferry  Yes        7
    6     walk  Yes        2
    7     bike   No         
    8   subway   No        8
    9    drive  Yes       19
    10   ferry  Yes        7
    11    walk   No         
    12    bike   No        4
    13  subway  Yes       16
    14   drive   No       12
    15   ferry  Yes        7
    

    If the following three conditions are met:

    commute = walk OR bike OR subway OR ferry
    AND
    kids = Yes
    AND
    distance is less than 10
    

    Then I'd like a new column called get.flyer to equal "Yes". The final data frame should look like this:

       commute kids distance get.flyer
    1     walk  Yes        1       Yes
    2     bike  Yes       12       Yes
    3   subway   No        5          
    4    drive   No       25          
    5    ferry  Yes        7       Yes
    6     walk  Yes        2       Yes
    7     bike   No                   
    8   subway   No        8          
    9    drive  Yes       19          
    10   ferry  Yes        7       Yes
    11    walk   No                   
    12    bike   No        4          
    13  subway  Yes       16       Yes
    14   drive   No       12          
    15   ferry  Yes        7       Yes