How to check if two data frames are equal

95,752

Solution 1

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

Solution 2

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

> identical(df1, df3)
[1] FALSE

> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                
[2] "Component 1: Numeric: lengths (5, 6) differ"                                                
[3] "Component 2: Lengths: 5, 6"                                                                 
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"   

Moreover, from what I've tested identical() seems to run much faster than all.equal().

Share:
95,752
Waldir Leoncio
Author by

Waldir Leoncio

Perpetual student.

Updated on July 08, 2022

Comments

  • Waldir Leoncio
    Waldir Leoncio almost 2 years

    Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:

    df1 <- data.frame(num = 1:5, let = letters[1:5])
    df2 <- df1
    df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
    df4 <- df3
    

    So this is what I do to compare them:

    table(x == y, useNA = 'ifany')
    

    Which works great when the datasets have no NAs:

    > table(df1 == df2, useNA = 'ifany')
    TRUE 
      10 
    

    But not so much when they have NAs:

    > table(df3 == df4, useNA = 'ifany')
    TRUE <NA> 
      11    1 
    

    In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.

    So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?

    P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

  • Waldir Leoncio
    Waldir Leoncio over 10 years
    I just got to know this function and will further test it to see if it really works for this particular task, but so far, so good. Thanks!
  • Ricardo Saporta
    Ricardo Saporta over 10 years
    It's important to note that if the items being compared are NOT equal, then all.equal will not return FALSE. Instead, you have to use isTRUE( all.equal(df2,df1) ) to get a TRUE/FALSE output from all.equal
  • Waldir Leoncio
    Waldir Leoncio over 10 years
    @RicardoSaporta, you're right, but in that case I believe it is better to just go ahead and use identical(), as @Metrics suggested above. The thing about all.equal() is that returns a vector "describing the differences between target and current", which can be good or bad depending on what kind of output you're looking for.
  • sbha
    sbha almost 6 years
    dplyr::all_equal() is another option. By default it ignores column and row order, and is sensitive to variable classes, but those defaults can be overidden: dplyr::all_equal(target, current, ignore_col_order = FALSE, ignore_row_order = FALSE, convert = TRUE)
  • Dan Chaltiel
    Dan Chaltiel almost 6 years
    For my two big data frames and identical(df2,df1) returns FALSE but isTRUE(all.equal(df2,df1)) returns TRUE (with all_equal() also). Any idea why ?