PySpark - Compare DataFrames

13,541

Solution 1

So I create a third DataFrame, joining DataFrame1 and DataFrame2, and then filter by the counts fields to check if they are equal or not:

Mismatch:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count != df3.df2_count).show()

Match:

df3 = df1.join(df2, [df1.name == df2.name] , how = 'inner' )
df3.filter(df3.df1_count == df3.df2_count).show()

Hope this comes in useful for someone

Solution 2

For small DataFrame comparisons, you can use the chispa library. This is particularly useful when performing DataFrame comparisons in a test suite. For big datasets, the accepted answer that uses a join is the best approach.

In this example, chispa.assert_df_equality(df1, df2), will output this error message:

enter image description here

The rows that mismatch are red and the rows that match are blue. This post has more info on testing PySpark code.

There's a cool library called deequ that is good for "data unit tests", but I'm not sure if there is a PySpark implementation.

Solution 3

The easy way is to use the diff transformation from the spark-extension package:

from gresearch.spark.diff import *

left = spark.createDataFrame([("Alice", 1500), ("Bob", 1000), ("Charlie", 150), ("Dexter", 100)], ["name", "count"])
right = spark.createDataFrame([("Alice", 1500), ("Bob", 200), ("Charlie", 150), ("Dexter", 10)], ["name", "count"])

diff = left.diff(right, 'name')
diff.show()
+----+-------+----------+-----------+                                           
|diff|   name|left_count|right_count|
+----+-------+----------+-----------+
|   N|  Alice|      1500|       1500|
|   C|    Bob|      1000|        200|
|   N|Charlie|       150|        150|
|   C| Dexter|       100|         10|
+----+-------+----------+-----------+

This shows you mismatch (C) and match (N) in one DataFrame.

And, of course, you can filter to get mismatches and matches only:

diff.where(diff['diff'] == 'C').show()
+----+------+----------+-----------+                                            
|diff|  name|left_count|right_count|
+----+------+----------+-----------+
|   C|   Bob|      1000|        200|
|   C|Dexter|       100|         10|
+----+------+----------+-----------+

diff.where(diff['diff'] == 'N').show()
+----+-------+----------+-----------+                                           
|diff|   name|left_count|right_count|
+----+-------+----------+-----------+
|   N|  Alice|      1500|       1500|
|   N|Charlie|       150|        150|
+----+-------+----------+-----------+

While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved, which is fully supported by this solution.

Share:
13,541
Admin
Author by

Admin

Updated on August 17, 2022

Comments

  • Admin
    Admin over 1 year

    I'm new to PySpark, So apoloigies if this is a little simple, I have found other questions that compare dataframes but not one that is like this, therefore I do not consider it to be a duplicate. I'm trying to compare two dateframes with similar structure. The 'name' will be unique, yet the counts could be different.

    So if the count is different I would like it to produce a dataframe or a python dictionary. just like below. Any ideas on how I would achieved something like this?

    DF1

    +-------+---------+
    |name   | count_1 |
    +-------+---------+
    |  Alice|   1500  |
    |    Bob|   1000  |
    |Charlie|   150   |
    | Dexter|   100   |
    +-------+---------+
    

    DF2

    +-------+---------+
    |name   | count_2 |
    +-------+---------+
    |  Alice|   1500  |
    |    Bob|   200   |
    |Charlie|   150   |
    | Dexter|   10    |
    +-------+---------+
    

    To produce the outcome:

    Mismatch

    +-------+-------------+--------------+
    |name   | df1_count   | df2_count    |
    +-------+-------------+--------------+
    |    Bob|   1000      |    200       |
    | Dexter|   100       |     10       |
    +-------+-------------+--------------+
    

    Match

    +-------+-------------+--------------+
    |name   | df1_count   | df2_count    |
    +-------+-------------+--------------+
    |  Alice|   1500      |   1500       |
    |Charlie|   150       |    150       |
    +-------+-------------+--------------+