Comparing two tables for equality in HIVE
Solution 1
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL
:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
Solution 2
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL
values. These might be causing the problem with your data.
Solution 3
Well, the best way is calculate the hash sum of each table, and compare the sum of hash. So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
Solution 4
I would recommend you not using any JOINs to try to compare tables:
- it is quite an expensive operations when tables are big (which is often the case in Hive)
- it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
Danzo
Updated on July 09, 2022Comments
-
Danzo almost 2 years
I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1 left outer join table2 t2 on t1.key=t2.key where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1 left outer join table2 t2 on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
-
Danzo almost 9 yearsSo this checks for duplicates, but how does it ensure that the tables are matching? Say table 1 has row (1,2,3,4), and table 2 has row (1,2,3,5). Is this query going to return both of these rows because cnt=1? @GordonLinoff
-
Danzo almost 9 yearsAny idea how you would do this in HIVEQL? @AHocevar
-
Gordon Linoff almost 9 years@Danzo . . . Yes. This query will return all rows that have no match in the other table. You can make this a subquery and do a
count(*)
to see if there are any such rows. -
A Hocevar almost 9 yearsAnd this is why I need more coffee... There is no MINUS operator in HiveQL, you will have to go with a full outer join as suggested by @Klas Lindbäck
-
Danzo almost 9 yearsFor the first one, if I know that none of the columns are null, does that imply that it checks if the tables are equal? I'm not entirely sure of the implications of your first statement. @KlasLindback
-
Klas Lindbäck almost 9 years@Danzo Yes, it is sufficient that one of the tables has no null values.
-
Randy over 6 yearscounting not only if the key matches but also the value in c1
-
mr.dev.null about 4 yearsIt fails when one of entry the columns in both table is null, any way to solve this?
-
LLL almost 4 yearsIf there anyway I can use this in pyspark?