merge dataframes based on common columns but keeping all rows from x

20,879

Solution 1

I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE) which should by default merge on all shared column names.

Solution 2

You want to merge based on all common columns. So first you need to find out which column names are common between the two dataframes.

common_col_names <- intersect(names(x), names(y))

Then you use this character vector as your by parameters in the merge function.

merge(x, y, by=common_col_names, all.x=TRUE)

Edit: after reading @Andrew Gustar's answer, I double checked the documentation for the merge function, and this is exactly the default by parameter:

## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)), # <-- Look here
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"),
      incomparables = NULL, ...)

Solution 3

The left_join function from dplyr should give you exactly what you want.

First create some dummy data...

shared <- tibble(a=rnorm(327), b=rnorm(327),c=rnorm(327),d=rnorm(327),e=rnorm(327),f=rnorm(327),g=rnorm(327))
x <- cbind(shared, tibble(v =rnorm(327),w=rnorm(327),x=rnorm(327),y=rnorm(327),z=rnorm(327)))
y <- cbind(shared, tibble(h =rnorm(327),i=rnorm(327),j=rnorm(327),k=rnorm(327),l=rnorm(327)))
rm(shared)

Then just run the join...

dplyr::left_join(x,y)
# Joining, by = c("a", "b", "c", "d", "e", "f", "g")

As you can see, the function figures out which columns are shared between the two and automatically uses them in the join.

Share:
20,879
dede
Author by

dede

Updated on April 20, 2020

Comments

  • dede
    dede about 4 years

    I need to merge two dataframes x and y which have about 50 columns in common and some unique columns, and I need to keep all the rows from x.

    It works if I run:

     NewDataframe <- merge(x, y, by=c("ColumnA", "ColumnB", "ColumnC"),all.x=TRUE)
    

    The issue is that there are more than 50 common columns, and I would rather avoid typing the names of all the common columns.

    I have tried with:

     NewDataframe <- merge(x, y, all.x=TRUE)
    

    But the following error appears:

     Error in merge.data.table(x, y, all.x = TRUE) :
     Elements listed in `by` must be valid column names in x and y
    

    Is there any way of using by with the common columns without typing all of them, but keeping all the rows from x?

    Thank you.