Merge multiple variables in R

10,722

Solution 1

You can use coalesce from dplyr:

library(dplyr)

df %>%
  mutate(DV_1 = coalesce(DV1_A, DV1_B, DV1_C),
         DV_2 = coalesce(DV2_A, DV2_B, DV2_C))

If you have a lot of DV columns to combine, you might not want to type all the column names. In this case, you can first grep the column names for each DV, parse each name to symbols with rlang::syms, then splice (!!!) the symbols in coalesce (Advice from @hadley):

library(rlang)
var_quo1 = syms(grep("DV1", names(df), value = TRUE))
var_quo2 = syms(grep("DV2", names(df), value = TRUE))

df %>%
  mutate(DV_1 = coalesce(!!! var_quo1),
         DV_2 = coalesce(!!! var_quo2))

If instead, you have a ton of DV's, you might not even want to type all the coalesce lines, in this case, you can create a function that outputs one DV column given an input number and lapply + bind_col all of them together:

DV_combine = function(num_DVs){

  DV_name = sym(paste0("DV", num_DVs))
  DV_syms = syms(grep(paste0("DV", num_DVs), names(df), value = TRUE))

  df %>%
    transmute(!!DV_name := coalesce(!!! DV_syms))
}

bind_cols(df, lapply(1:2, DV_combine))

Result:

  ID DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C FACT DV_1 DV_2
1  1     1    NA    NA     3    NA    NA    A    1    3
2  2    NA     4    NA    NA     3    NA    B    4    3
3  3    NA    NA     5    NA    NA     5    C    5    5

Note:

This method will work for both numeric and character class columns, but not factor's. One should first convert the factor columns to character before using this method.

Data:

df = structure(list(ID = c(1, 2, 3), DV1_A = c(1, NA, NA), DV1_B = c(NA, 
4, NA), DV1_C = c(NA, NA, 5), DV2_A = c(3, NA, NA), DV2_B = c(NA, 
3, NA), DV2_C = c(NA, NA, 5), FACT = structure(1:3, .Label = c("A", 
"B", "C"), class = "factor")), .Names = c("ID", "DV1_A", "DV1_B", 
"DV1_C", "DV2_A", "DV2_B", "DV2_C", "FACT"), row.names = c(NA, 
-3L), class = "data.frame")

Solution 2

The base transform will do this:

d <- transform(d, 
               DV1 = rowSums(d[c("DV1_A", "DV1_B", "DV1_C")], na.rm=T),
               DV2 = rowSums(d[c("DV2_A", "DV2_B", "DV2_C")], na.rm=T)
          )

Solution 3

You could also do this via gather and spread with tidyr and dplyr. Less concise than @useR's solution, but might be useful if you need to do any intermediate manipulation.

library(dplyr)
library(tidyr)

df %>% 
  gather(variable, value, -ID, -FACT, na.rm = TRUE) %>% 
  mutate(variable = gsub("\\_[A-Z]", "", variable)) %>% 
  spread(variable, value) %>% 
  left_join(df)

  ID FACT DV1 DV2 DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C
1  1    A   1   3     1    NA    NA     3    NA    NA
2  2    B   4   3    NA     4    NA    NA     3    NA
3  3    C   5   5    NA    NA     5    NA    NA     5
Share:
10,722

Related videos on Youtube

randmlaber
Author by

randmlaber

Updated on June 04, 2022

Comments

  • randmlaber
    randmlaber almost 2 years

    I have a dataset such that the same variable is contained in difference columns for each subject. I want to merge them to the same columns.

    E.g.:, I have this dataframe, and there are three DVs, but they are in different columns (A,B,C) for different subjects.

    data.frame(ID = c(1,2,3), DV1_A=c(1,NA,NA), DV1_B= c(NA,4,NA), DV1_C = c(NA,NA,5), DV2_A=c(3,NA,NA), DV2_B=c(NA,3,NA), DV2_C=c(NA,NA,5), FACT = c("A","B","C"))
    

    How can I merge them to just two columns? so the result is:

    data.frame(ID = c(1,2,3), DV1_A=c(1,NA,NA), DV1_B= c(NA,4,NA), DV1_C = c(NA,NA,5), DV2_A=c(3,NA,NA), DV2_B=c(NA,3,NA), DV2_C=c(NA,NA,5), FACT = c("A","B","C"), DV_1 = c(1,4,5), DV_2 = c(3,3,5))
    
  • hadley
    hadley over 6 years
    Generally I'd advise against splicing data frames into calls with !!!. It's better to splice in variable names
  • acylam
    acylam over 6 years
    Thanks for the advice @hadley, but do you mean something like greping the colnames, parsing to quosures, then splicing with !!!? I was trying to take advantage of starts_with, but yea, I agree that splicing whole dataframes is probably not a good idea.
  • hadley
    hadley over 6 years
    Yeah, exactly. The main challenge is that the select() syntax is so nice that you want to use it in vectorised functions that take ..., but there's no great way to do so currently.