Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

397,999

Solution 1

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
   data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e

Solution 2

sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr) 
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2

semi_join(a1,a2)

Solution 3

In dplyr:

setdiff(a1,a2)

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

In the SQLverse this is called a

Left Excluding Join Venn Diagram

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the setdiff() code when using the OP's data:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Solution 4

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

Solution 5

I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

A more complicated example:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

The package also has an html_output command for quick checking

df_compare$html_output enter image description here

Share:
397,999

Related videos on Youtube

Tal Galili
Author by

Tal Galili

Statistics, blogging, and the hope for a happy long life.

Updated on June 23, 2021

Comments

  • Tal Galili
    Tal Galili almost 3 years

    I have the following 2 data.frames:

    a1 <- data.frame(a = 1:5, b=letters[1:5])
    a2 <- data.frame(a = 1:3, b=letters[1:3])
    

    I want to find the row a1 has that a2 doesn't.

    Is there a built in function for this type of operation?

    (p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

    Here is my solution:

    a1 <- data.frame(a = 1:5, b=letters[1:5])
    a2 <- data.frame(a = 1:3, b=letters[1:3])
    
    rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
    {
        a1.vec <- apply(a1, 1, paste, collapse = "")
        a2.vec <- apply(a2, 1, paste, collapse = "")
        a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
        return(a1.without.a2.rows)
    }
    rows.in.a1.that.are.not.in.a2(a1,a2)
    
  • Hendy
    Hendy almost 11 years
    I find this function confusing. I thought it would work for me, but it seems to only work as shown above if one set contains identically matching rows of the other set. Consider this case: a2 <- data.frame(a = c(1:3, 1), b = c(letters[1:3], "c")). Leave a1 the same. Now try the comparison. It's not clear to me even in reading the options what the proper way is to list only common elements.
  • Louis Maddox
    Louis Maddox almost 10 years
    So... in looking for a missing value, you create another missing value... How do you find the missing value(s) in included_a1? :-/
  • drastega
    drastega over 8 years
    Thanks for anti_join and semi_join!
  • Eduardo Leoni
    Eduardo Leoni over 8 years
    use is.na() and subset, or dplyr::filter
  • David Arenburg
    David Arenburg over 7 years
    How is this different from what OP already tried? You've used the exact same code like Tal to compare a single column instead of the whole row (which was the requirement)
  • steveb
    steveb over 7 years
    Since the OP asks for items in a1 that are not in a2, don't you want to use something like semi_join(a1, a2, by = c('a','b')) ? In the answer by "Rickard", I see that semi_join was suggested.
  • leerssej
    leerssej about 7 years
    Sure! Another great choice, too; particularly if you have dataframes with only a join key and differing column names.
  • 3pitt
    3pitt over 6 years
    is there a reason why anti_join would return a null DF, as would sqldf, but the functions identical(a1,a2) and all.equal() would contradict that?
  • Akshay Gaur
    Akshay Gaur over 6 years
    Just wanted to add here that anti_join and semi_join would not work in some cases like mine. I was getting "Error: Columns must be 1d atomic vectors or lists" for my data frame. Maybe I could process my data so that these functions work. Sqldf worked right out of the gate!
  • Bryan F
    Bryan F over 5 years
    This answer works for the OP's scenario. What about the more general case when the variable "a" does match between the two data.frames("a1" and "a2"), but the variable "b" does not?
  • mtelesha
    mtelesha over 5 years
    setdiff is from lubridate::setdiff and not from library(dplyr)
  • leerssej
    leerssej over 5 years
    @mtelesha - Hmm, the docs and source code for dplyr show it being there: (dplyr.tidyverse.org/reference/setops.html , github.com/tidyverse/dplyr/blob/master/R/sets.). Additionally, when the dplyr library is loaded it even reports masking the base setdiff() function that works on two vectors: stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html. Maybe you have loaded the lubridate library after dplyr and it is suggesting it as the source in the tabcomplete listing?
  • Rodrigo
    Rodrigo over 5 years
    Thank you for teaching a way without installing a new library!
  • Deep
    Deep over 5 years
    your compareDF is exactly what I need , and have done a good job with small sets.However:1) Not working with a set 50Million rows with 3 column (say) it says out of memory with 32 GB RAM. 2) I also see HTML takes some time to write, can the same output be sent to TEXT file?
  • Alex Joseph
    Alex Joseph over 5 years
    1) Yeah 50 million rows is A LOT OF data, just to hold in memory ;). I'm aware that it is not great with large datasets, so you might have to do some sort of chunking. 2) you can give the argument - limit_html = 0, to avoid it printing to a HTML. The same output is in compare_output$comparison_df which you can write to a CSV/TEXT fule using native R functions.
  • Deep
    Deep about 5 years
    Thanks for your reply @Alex Joseph , I will give it a try and let you know how it goes.
  • Deep
    Deep about 5 years
    Hi @Alex Joseph, thanks for the input the text format did work but found an issue , raised it under: stackoverflow.com/questions/54880218/…
  • slhck
    slhck almost 5 years
    There is a conflict between lubridate and dplyr, see github.com/tidyverse/lubridate/issues/693
  • stucash
    stucash over 4 years
    @AkshayGaur it should just be a data format or data cleaning problem; sqldf is just sql everything is pre-processed to be like nromal DB such that we could just run sql on the data.
  • PM0087
    PM0087 almost 4 years
    It can't handle different numbers of columns. I got an error The two data frames have different columns!
  • Alex Joseph
    Alex Joseph almost 4 years
    @PeyM87 - If the columns are different it's very easily visible from the names(df) right? What is the behaviour you're expecting? If you can create a in issue reprex on the github, I can take a look at it.
  • PM0087
    PM0087 almost 4 years
    @AlexJoseph: I have dataframe1 with X number of columns. After some time, new data comes in and I have dataframe2 with Y number of colums, in which some columns are always common. I thought this compare would for instance SUM the common columns and Add columns if there are any new.
  • Alex Joseph
    Alex Joseph almost 4 years
    If you have columns changing, doing a setdiff(names(df1), names(df2)) is probably the best approach
  • bmc
    bmc about 3 years
    Also consider dplyr::intersection