How do I create a new column based on multiple conditions from multiple columns?
Solution 1
We can use %in%
for comparing multiple elements in a column, &
to check if both conditions are TRUE.
library(dplyr)
df %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
as.character(kids) == "Yes" &
as.numeric(as.character(distance)) < 10)+1] )
It is better to create the data.frame
with stringsAsFactors=FALSE
as by default it is TRUE
. If we check the str(df)
, we can find that all the columns are factor
class. Also, if there are missing values, instead of ""
, NA
can be used to avoid converting the class
of a numeric
column to something else.
If we rewrite the creation of 'df'
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)
the above code can be simplified
df1 %>%
mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10)+1] )
For better understanding, some people prefer ifelse
df1 %>%
mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
This can be also done easily with base R
methods
df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") &
kids == "Yes" &
distance < 10,
"Yes", ""))
Solution 2
The solution is already pointed out by @akrun. I'd like to present it in a more 'wrapped up' way.
You can use the ifelse
statement to create a column based on one (or more) conditions. But first you have to change the 'encoding' of missing values in the distance column. You used ""
to indicate a missing value, this however converts the entire column to string
and inhibits numerical comparison (distance < 10
is not possible). The R
way of indicating a missing value is NA
, your column definition of distance
should be:
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
The ifelse
statement then looks like this:
df$get.flyer <- ifelse(
(
(df$commute %in% c("walk", "bike", "subway", "ferry")) &
(df$kids == "Yes") &
(df$distance < 10)
),
1, # if condition is met, put 1
0 # else put 0
)
Optional: Consider encoding your other columns in a different way as well:
- you could use
TRUE
andFALSE
instead of "Yes" and "No" for thekids
variable - you could use a
factor
for commute
Solution 3
Example, check if first_column_name is contained in second_column_name and write result to new_column
df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))
Details:
df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.
Ankie
Updated on July 17, 2022Comments
-
Ankie almost 2 years
I'm trying add a new column to a data frame based on several conditions from other columns. I have the following data:
> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry") > kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes") > distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7) > > df = data.frame(commute, kids, distance) > df commute kids distance 1 walk Yes 1 2 bike Yes 12 3 subway No 5 4 drive No 25 5 ferry Yes 7 6 walk Yes 2 7 bike No 8 subway No 8 9 drive Yes 19 10 ferry Yes 7 11 walk No 12 bike No 4 13 subway Yes 16 14 drive No 12 15 ferry Yes 7
If the following three conditions are met:
commute = walk OR bike OR subway OR ferry AND kids = Yes AND distance is less than 10
Then I'd like a new column called get.flyer to equal "Yes". The final data frame should look like this:
commute kids distance get.flyer 1 walk Yes 1 Yes 2 bike Yes 12 Yes 3 subway No 5 4 drive No 25 5 ferry Yes 7 Yes 6 walk Yes 2 Yes 7 bike No 8 subway No 8 9 drive Yes 19 10 ferry Yes 7 Yes 11 walk No 12 bike No 4 13 subway Yes 16 Yes 14 drive No 12 15 ferry Yes 7 Yes