How can I merge multiple dataframes with the same column names?

25,971

Solution 1

You can use Reduce() and complete.cases() as follows:

merged <- Reduce(function(x, y) merge(x, y, all=TRUE), 
                 list(master, cond_a, cond_b, cond_c, cond_d))
merged[complete.cases(merged), ]
#    userid condition answer1 answer2
# 1     bar         A       1       2
# 2     bar         B       3       4
# 4     bar         C       5       6
# 6     bar         D       7       8
# 8     foo         A       1       2
# 9     foo         B       3       4
# 11    foo         C       5       6
# 13    foo         D       7       8

Reduce() might take some getting accustomed to. You define your function, and then provide a list of objects to repeatedly apply the function to. Thus, that statement is like doing:

temp1 <- merge(master, cond_a, all=TRUE)
temp2 <- merge(temp1, cond_b, all=TRUE)
temp3 <- merge(temp2, ....)

Or something like:

merge(merge(merge(master, cond_a, all=TRUE), cond_b, all=TRUE), cond_c, all=TRUE)

complete.cases() creates a logical vector of whether the specified columns are "complete" or not; this logical vector can be used to subset from the merged data.frame.

Solution 2

As stated by the OP, given that no explicit relationship with the master data frame, an option is this:

temp <-rbind(cond_a,cond_b,cond_c,cond_d)
temp[order(temp["userid"]),]

Perhaps if any relationship was known, there could be a non-simplistic solution.

Solution 3

You can express this join as a SQL statement, and then use the sqldf library to execute it.

cond_all = rbind(cond_a, cond_b, cond_c, cond_d)

> sqldf('select p.userid as userid, p.condition as condition, answer1, answer2 from master as p join cond_all as q on p.userid=q.userid and p.condition=q.condition order by userid, condition')
  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

You mentioned in a comment that the master dataframe has extra columns that do not exist in the cond dataframes. You should be able to modify this SQL query to still work for this case.

Share:
25,971
slhck
Author by

slhck

Video quality guy and researcher, PhD student in computer science. Founder/CEO of AVEQ. I offer personal consulting and help with video encoding, especially with FFmpeg. Send a mail to werner.robitza at gmail.com. More info on my website.

Updated on July 10, 2022

Comments

  • slhck
    slhck almost 2 years

    What I have:

    I have a "master" dataframe that has the following columns:

    userid, condition
    

    Since there are four experiment conditions, I also have four dataframes that carry answer information, with the following columns:

    userid, condition, answer1, answer2
    

    Now, I'd like to join these, so all combinations of user IDs, conditions and their answers to these conditions are merged. Each condition should only have the correct answer in the appropriate column, per row.


    Short, self-contained example:

    master = data.frame(userid=c("foo","foo","foo","foo","bar","bar","bar","bar"), condition=c("A","B","C","D","A","B","C","D"))
    cond_a = data.frame(userid=c("foo","bar"), condition="A", answer1=c("1","1"), answer2=c("2","2"))
    cond_b = data.frame(userid=c("foo","bar"), condition="B", answer1=c("3","3"), answer2=c("4","4"))
    cond_c = data.frame(userid=c("foo","bar"), condition="C", answer1=c("5","5"), answer2=c("6","6"))
    cond_d = data.frame(userid=c("foo","bar"), condition="D", answer1=c("7","7"), answer2=c("8","8"))
    

    How do I merge all conditions into the master, so the master table looks like follows?

      userid condition answer1 answer2
    1    bar         A       1       2
    2    bar         B       3       4
    3    bar         C       5       6
    4    bar         D       7       8
    5    foo         A       1       2
    6    foo         B       3       4
    7    foo         C       5       6
    8    foo         D       7       8
    

    I've tried the following:

    temp = merge(master, cond_a, all.x=TRUE)
    

    Which gives me:

      userid condition answer1 answer2
    1    bar         A       1       2
    2    bar         B    <NA>    <NA>
    3    bar         C    <NA>    <NA>
    4    bar         D    <NA>    <NA>
    5    foo         A       1       2
    6    foo         B    <NA>    <NA>
    7    foo         C    <NA>    <NA>
    8    foo         D    <NA>    <NA>
    

    But as soon as I do this…

    merge(temp, cond_b, all.x=TRUE)
    

    There are no values for condition B. How come?

      userid condition answer1 answer2
    1    bar         A       1       2
    2    bar         B    <NA>    <NA>
    3    bar         C    <NA>    <NA>
    4    bar         D    <NA>    <NA>
    5    foo         A       1       2
    6    foo         B    <NA>    <NA>
    7    foo         C    <NA>    <NA>
    8    foo         D    <NA>    <NA>
    
  • slhck
    slhck over 11 years
    Cool, this works great for my specific example! Now, what would I have to do if master data frame included additional columns that I wanted to keep? Basically, I'd just want to assign the answers to the unique userid and condition pairs.
  • slhck
    slhck over 11 years
    So, to be more precise I now get a row with foo, A, 1, 2, other, but then subsequently foo, B, <NA>, <NA>, blah, which would be removed in the complete.cases() step.
  • slhck
    slhck over 11 years
    What exactly do you mean by "explicit relationship"? Obviously, userid and condition are factors in the experiment data, while the answer scores are simple numeric items that belong to a certain user and treatment.
  • slhck
    slhck over 11 years
    I think I got it. When I rbind all condition dataframes first and then merge(master, all_conditions, all.x=TRUE), this seems to give me everything.
  • A_K
    A_K over 11 years
    What I meant was that even without the master df, one could arrive at the output you want. So just cond_[a-d] suffices. If there was more relationships (like columns in there) between master & cond_[a-d] then this might be insufficient as a solution