Check equality for two Spark DataFrames in Scala

11,766

Solution 1

import org.scalatest.{BeforeAndAfterAll, FeatureSpec, Matchers}

outDf.collect() should contain theSameElementsAs (dfComparable.collect())
# or ( obs order matters ! )

// outDf.except(dfComparable).toDF().count should be(0)
outDf.except(dfComparable).count should be(0)   

Solution 2

If you want to check if both the data frames are equal or not for testing purpose, you can make use of subtract() method of data frame (supported in version 1.3 and above)

You can check if diff of both data frames is empty or 0. e.g. df1.subtract(df2).count() == 0

Share:
11,766
codeinstyle
Author by

codeinstyle

Updated on June 27, 2022

Comments

  • codeinstyle
    codeinstyle almost 2 years

    I'm new to Scala and am having problems writing unit tests.

    I'm trying to compare and check equality for two Spark DataFrames in Scala for unit testing, and realized that there is no easy way to check equality for two Spark DataFrames.

    The C++ equivalent code would be (assuming that the DataFrames are represented as double arrays in C++):

        int expected[10][2];
        int result[10][2];
        for (int row = 0; row < 10; row++) {
            for (int col = 0; col < 2; col++) {
                if (expected[row][col] != result[row][col]) return false;
            }
        }
    

    The actual test would involve testing for equality based on the data types of the columns of the DataFrames (testing with precision tolerance for floats, etc).

    It seems like there's not an easy way to iteratively loop over all the elements in the DataFrames using Scala and the other solutions for checking equality of two DataFrames such as df1.except(df2) do not work in my case as I need to be able to provide support for testing equality with tolerance for floats and doubles.

    Of course, I could try to round all the elements beforehand and compare the results afterwards, but I would like to see if there are any other solutions that would allow me to iterate through the DataFrames to check for equality.

  • codeinstyle
    codeinstyle over 7 years
    Thanks for the suggestion, but df1.except(df2) that I mentioned in my question has the same functionality as the df1.subtract(df2) and does not really work in this situation, where I am hoping to compare the values with precision tolerance.
  • bogdan.rusu
    bogdan.rusu over 4 years
    except function already returns a Dataframe so no need for toDF
  • Theo
    Theo over 3 years
    outDf.except(dfComparable).count should be(0) is not a good choice because .except returns a table with the elements from left that are not in right. If some element are missing in left the test will not fail. assertSmallDataFrameEqualityis a better alternative see stackoverflow.com/questions/31197353/…
  • Theo
    Theo over 3 years
    assertDataFrameEquals from spark-testing-base is also an alternative.