Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2
Solution 1
This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare
:
library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c
The function compare
gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):
difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
Solution 2
sqldf
provides a nice solution
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
require(sqldf)
a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')
And the rows which are in both data frames:
a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')
The new version of dplyr
has a function, anti_join
, for exactly these kinds of comparisons
require(dplyr)
anti_join(a1,a2)
And semi_join
to filter rows in a1
that are also in a2
semi_join(a1,a2)
Solution 3
In dplyr:
setdiff(a1,a2)
Basically, setdiff(bigFrame, smallFrame)
gets you the extra records in the first table.
In the SQLverse this is called a
For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins
But back to this question - here are the results for the setdiff()
code when using the OP's data:
> a1
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
> a2
a b
1 1 a
2 2 b
3 3 c
> setdiff(a1,a2)
a b
1 4 d
2 5 e
Or even anti_join(a1,a2)
will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Solution 4
It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:
a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)
missing values in included_a1 will note which rows are missing in a1. similarly for a2.
One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.
Solution 5
I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.
> df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
> df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
> df_compare = compare_df(df1, df2, "row")
> df_compare$comparison_df
row chng_type a b
1 4 + 4 d
2 5 + 5 e
A more complicated example:
library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Duster 360", "Merc 240D"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
hp = c(110, 110, 181, 110, 245, 62),
cyl = c(6, 6, 4, 6, 8, 4),
qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))
df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
hp = c(110, 110, 93, 110, 175, 105),
cyl = c(6, 6, 4, 6, 8, 6),
qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))
> df_compare$comparison_df
grp chng_type id1 id2 hp cyl qsec
1 1 - Hornet Sportabout Dus 175 8 17.02
2 2 + Datsun 710 Dat 181 4 33.00
3 2 - Datsun 710 Dat 93 4 18.61
4 3 + Duster 360 Dus 245 8 15.84
5 7 + Merc 240D Mer 62 4 20.00
6 8 - Valiant Val 105 6 20.22
The package also has an html_output command for quick checking
Related videos on Youtube
Tal Galili
Statistics, blogging, and the hope for a happy long life.
Updated on June 23, 2021Comments
-
Tal Galili almost 3 years
I have the following 2 data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3])
I want to find the row a1 has that a2 doesn't.
Is there a built in function for this type of operation?
(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)
Here is my solution:
a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3]) rows.in.a1.that.are.not.in.a2 <- function(a1,a2) { a1.vec <- apply(a1, 1, paste, collapse = "") a2.vec <- apply(a2, 1, paste, collapse = "") a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,] return(a1.without.a2.rows) } rows.in.a1.that.are.not.in.a2(a1,a2)
-
Hendy almost 11 yearsI find this function confusing. I thought it would work for me, but it seems to only work as shown above if one set contains identically matching rows of the other set. Consider this case:
a2 <- data.frame(a = c(1:3, 1), b = c(letters[1:3], "c"))
. Leavea1
the same. Now try the comparison. It's not clear to me even in reading the options what the proper way is to list only common elements. -
Louis Maddox almost 10 yearsSo... in looking for a missing value, you create another missing value... How do you find the missing value(s) in
included_a1
? :-/ -
drastega over 8 yearsThanks for
anti_join
andsemi_join
! -
Eduardo Leoni over 8 yearsuse is.na() and subset, or dplyr::filter
-
David Arenburg over 7 yearsHow is this different from what OP already tried? You've used the exact same code like Tal to compare a single column instead of the whole row (which was the requirement)
-
steveb over 7 yearsSince the OP asks for items in
a1
that are not ina2
, don't you want to use something likesemi_join(a1, a2, by = c('a','b'))
? In the answer by "Rickard", I see thatsemi_join
was suggested. -
leerssej about 7 yearsSure! Another great choice, too; particularly if you have dataframes with only a join key and differing column names.
-
3pitt over 6 yearsis there a reason why anti_join would return a null DF, as would sqldf, but the functions identical(a1,a2) and all.equal() would contradict that?
-
Akshay Gaur over 6 yearsJust wanted to add here that anti_join and semi_join would not work in some cases like mine. I was getting "Error: Columns must be 1d atomic vectors or lists" for my data frame. Maybe I could process my data so that these functions work. Sqldf worked right out of the gate!
-
Bryan F over 5 yearsThis answer works for the OP's scenario. What about the more general case when the variable "a" does match between the two data.frames("a1" and "a2"), but the variable "b" does not?
-
mtelesha over 5 yearssetdiff is from lubridate::setdiff and not from library(dplyr)
-
leerssej over 5 years@mtelesha - Hmm, the docs and source code for dplyr show it being there: (dplyr.tidyverse.org/reference/setops.html , github.com/tidyverse/dplyr/blob/master/R/sets.). Additionally, when the dplyr library is loaded it even reports masking the base
setdiff()
function that works on two vectors: stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html. Maybe you have loaded the lubridate library after dplyr and it is suggesting it as the source in the tabcomplete listing? -
Rodrigo over 5 yearsThank you for teaching a way without installing a new library!
-
Deep over 5 yearsyour compareDF is exactly what I need , and have done a good job with small sets.However:1) Not working with a set 50Million rows with 3 column (say) it says out of memory with 32 GB RAM. 2) I also see HTML takes some time to write, can the same output be sent to TEXT file?
-
Alex Joseph over 5 years1) Yeah 50 million rows is A LOT OF data, just to hold in memory ;). I'm aware that it is not great with large datasets, so you might have to do some sort of chunking. 2) you can give the argument - limit_html = 0, to avoid it printing to a HTML. The same output is in compare_output$comparison_df which you can write to a CSV/TEXT fule using native R functions.
-
Deep about 5 yearsThanks for your reply @Alex Joseph , I will give it a try and let you know how it goes.
-
Deep about 5 yearsHi @Alex Joseph, thanks for the input the text format did work but found an issue , raised it under: stackoverflow.com/questions/54880218/…
-
slhck almost 5 yearsThere is a conflict between lubridate and dplyr, see github.com/tidyverse/lubridate/issues/693
-
stucash over 4 years@AkshayGaur it should just be a data format or data cleaning problem; sqldf is just sql everything is pre-processed to be like nromal DB such that we could just run sql on the data.
-
PM0087 almost 4 yearsIt can't handle different numbers of columns. I got an error
The two data frames have different columns!
-
Alex Joseph almost 4 years@PeyM87 - If the columns are different it's very easily visible from the names(df) right? What is the behaviour you're expecting? If you can create a in issue reprex on the github, I can take a look at it.
-
PM0087 almost 4 years@AlexJoseph: I have dataframe1 with X number of columns. After some time, new data comes in and I have dataframe2 with Y number of colums, in which some columns are always common. I thought this compare would for instance SUM the common columns and Add columns if there are any new.
-
Alex Joseph almost 4 yearsIf you have columns changing, doing a setdiff(names(df1), names(df2)) is probably the best approach
-
bmc about 3 yearsAlso consider
dplyr::intersection