HDFS File Comparison

16,121

Solution 1

There is no diff command provided with hadoop, but you can actually use redirections in your shell with the diff command:

diff <(hadoop fs -cat /path/to/file) <(hadoop fs -cat /path/to/file2)

If you just want to know if 2 files are identical or not without caring to know the differences, I would suggest another checksum-based approach: you could get the checksums for both files and then compare them. I think Hadoop doesn't need to generate checksums because they are already stored so it should be fast, but I may be wrong. I don't think there's a command line option for that but you could easily do this with the Java API and create a small app:

FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;

Solution 2

Well, the simplest answer is probably:

diff <(hadoop fs -cat file1) <(hadoop fs -cat file2)

It will just run on your local machine. If that's too slow, then yes, you'd have to do something with Hive and MapReduce, but that's a little trickier, and won't exactly match the in-order comparison that diff does.

Share:
16,121
ftw
Author by

ftw

Updated on August 04, 2022

Comments

  • ftw
    ftw almost 2 years

    How can I compare two HDFS files since there is no diff?

    I was thinking of using Hive tables and loading data from HDFS and then using join statements on 2 tables. Is there any better approach?

  • vinayak_narune
    vinayak_narune over 4 years
    I want to compare two hdfs directory , one has compacted data (4 files) and other has 50 files uncompacted, how to compare directories...
  • Omar Khan
    Omar Khan almost 4 years
    The checksum may be different if the block sizes are different even for the same file.