merge dataframes based on common columns but keeping all rows from x
Solution 1
I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE)
which should by default merge on all shared column names.
Solution 2
You want to merge based on all common columns. So first you need to find out which column names are common between the two dataframes.
common_col_names <- intersect(names(x), names(y))
Then you use this character vector as your by
parameters in the merge function.
merge(x, y, by=common_col_names, all.x=TRUE)
Edit: after reading @Andrew Gustar's answer, I double checked the documentation for the merge function, and this is exactly the default by
parameter:
## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)), # <-- Look here
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"),
incomparables = NULL, ...)
Solution 3
The left_join
function from dplyr
should give you exactly what you want.
First create some dummy data...
shared <- tibble(a=rnorm(327), b=rnorm(327),c=rnorm(327),d=rnorm(327),e=rnorm(327),f=rnorm(327),g=rnorm(327))
x <- cbind(shared, tibble(v =rnorm(327),w=rnorm(327),x=rnorm(327),y=rnorm(327),z=rnorm(327)))
y <- cbind(shared, tibble(h =rnorm(327),i=rnorm(327),j=rnorm(327),k=rnorm(327),l=rnorm(327)))
rm(shared)
Then just run the join...
dplyr::left_join(x,y)
# Joining, by = c("a", "b", "c", "d", "e", "f", "g")
As you can see, the function figures out which columns are shared between the two and automatically uses them in the join.
dede
Updated on April 20, 2020Comments
-
dede about 4 years
I need to merge two dataframes x and y which have about 50 columns in common and some unique columns, and I need to keep all the rows from x.
It works if I run:
NewDataframe <- merge(x, y, by=c("ColumnA", "ColumnB", "ColumnC"),all.x=TRUE)
The issue is that there are more than 50 common columns, and I would rather avoid typing the names of all the common columns.
I have tried with:
NewDataframe <- merge(x, y, all.x=TRUE)
But the following error appears:
Error in merge.data.table(x, y, all.x = TRUE) : Elements listed in `by` must be valid column names in x and y
Is there any way of using
by
with the common columns without typing all of them, but keeping all the rows from x?Thank you.