Comparing two tables for equality in HIVE

46,815

Solution 1

The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.

The second one will find rows that exist in t1 but not in t2.

To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:

select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
   or t2.key is null /* this condition matches rows that only exist in t1 */

Solution 2

If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:

select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
      (select t2.*, 2 as which from table2 t2)
     ) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;

There are various ways that you can relax the conditions in the first paragraph, if necessary.

Note that this version also works when the columns have NULL values. These might be causing the problem with your data.

Solution 3

Well, the best way is calculate the hash sum of each table, and compare the sum of hash. So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:

select sum(hash(*)) from t1;
select sum(hash(*)) from t2;

And you just need to compare the return values.

Solution 4

I would recommend you not using any JOINs to try to compare tables:

  • it is quite an expensive operations when tables are big (which is often the case in Hive)
  • it can give problems when some rows/"join keys" are repeated

(and it can also be unpractical when data are in different clusters/datacenters/clouds).

Instead, I think using a checksum approach and comparing the checksums of both tables is best.

I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:

https://github.com/bolcom/hive_compared_bq

I hope that can help you!

Share:
46,815
Danzo
Author by

Danzo

Updated on July 09, 2022

Comments

  • Danzo
    Danzo almost 2 years

    I have two tables, table1 and table2. Each with the same columns:

    key, c1, c2, c3
    

    I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):

    select count(*) from table1 t1 
    left outer join table2 t2
    on t1.key=t2.key
    where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
    

    And

    select count(*) from table1 t1
    left outer join table2 t2
    on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
    where t2.key is null
    

    So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.

  • Danzo
    Danzo almost 9 years
    So this checks for duplicates, but how does it ensure that the tables are matching? Say table 1 has row (1,2,3,4), and table 2 has row (1,2,3,5). Is this query going to return both of these rows because cnt=1? @GordonLinoff
  • Danzo
    Danzo almost 9 years
    Any idea how you would do this in HIVEQL? @AHocevar
  • Gordon Linoff
    Gordon Linoff almost 9 years
    @Danzo . . . Yes. This query will return all rows that have no match in the other table. You can make this a subquery and do a count(*) to see if there are any such rows.
  • A Hocevar
    A Hocevar almost 9 years
    And this is why I need more coffee... There is no MINUS operator in HiveQL, you will have to go with a full outer join as suggested by @Klas Lindbäck
  • Danzo
    Danzo almost 9 years
    For the first one, if I know that none of the columns are null, does that imply that it checks if the tables are equal? I'm not entirely sure of the implications of your first statement. @KlasLindback
  • Klas Lindbäck
    Klas Lindbäck almost 9 years
    @Danzo Yes, it is sufficient that one of the tables has no null values.
  • Randy
    Randy over 6 years
    counting not only if the key matches but also the value in c1
  • mr.dev.null
    mr.dev.null about 4 years
    It fails when one of entry the columns in both table is null, any way to solve this?
  • LLL
    LLL almost 4 years
    If there anyway I can use this in pyspark?