How to merge two dataframes using multiple columns as key?

21,044

Solution 1

As Akrun mentioned in comments, your lookup table (DF2) needs to be reduced to just its unique A/B combinations. For your current dataframe, this isn't a problem, but you will need additional rules if there are multiple possible values for the same combination. From there, the solution is easy:

DF2.u <- unique(DF2)
DF3 <- merge(DF1, DF2.u, all = T)

Note that this will produce a new dataframe with an empty TEST column (all values NA), and a VAL column assigned from DF2. To do exactly what you wanted (replace TEST with VAL where possible), here is some slightly clunkier code:

DF1$TEST <- merge(DF1, DF2.u, all = T)$VAL

EDIT: in response to your question, you can boil down DF2 if necessary quite simple:

DF2$C <- c(1:12) #now unique() won't work
DF2.u <- unique(DF2[1:3])

 A B VAL
1 A C   1
2 A D   3

Solution 2

# this is your DF1    
DF1 <- data.frame("A" = rep(c("A","B"), 18),
                      "B" = rep(c("C","D","E"), 12),
                      "NUM"= rep(rnorm(36,10,1)),
                      "TEST" = rep(NA,36))

#this is a DF2 i created, with unique A, B, VAL
DF2 <- data.frame("A" = rep(c("A","B"),3),
                  "B" = rep(c("C","D","E"),2),
                  "VAL" = rep(1:6))

# and this is the answer of what i assume you want      
tmp <- merge(DF1,DF2, by=c("A","B"), all.x=TRUE, all.y=FALSE)
DF1[4] <- tmp[5]
Share:
21,044
goldisfine
Author by

goldisfine

Updated on March 12, 2020

Comments

  • goldisfine
    goldisfine about 4 years

    Say I have the following dataframes:

    DF1 <- data.frame("A" = rep(c("A","B"), 18),
                      "B" = rep(c("C","D","E"), 12),
                      "NUM"= rep(rnorm(36,10,1)),
                      "TEST" = rep(NA,36))
    
    DF2 <- data.frame("A" = rep("A",6),
                      "B" = rep(c("C","D"),6),
                      "VAL" = rep(c(1,3),3))
    

    *Note: Each unique combination of variables A and B in DF2 should have a unique VAL.

    For each row, I would like to replace the NA in TEST with the corresponding value of VAL in DF1 if the values in columns A and A match and the values in columns B and B match for that row. Otherwise, I'd leave TEST as NA. How would I do this without looping through each combination using match?

    Ideally, an answer would scale to two data frames with many columns to match upon.

  • goldisfine
    goldisfine about 9 years
    In my actual example, I have columns in DF2 which are numeric, and so if I passed the entire DF then I would have a unique DF that was too large. Can I subset DF2 to rows which are only unique in the factor variables?
  • Joe
    Joe about 9 years
    Of course, the actual indices you take will vary based on what your extra columns are and how they are arranged
  • smci
    smci about 9 years
    In Python pandas, you could just simply set a compound key using variables A,B. Then this is a simple left-join/merge.