Find all duplicate files by MD5 hash

9,538

Solution 1

From "man xargs": -I implies -L 1 So this is not most efficient. It would be more efficient, if you just give as many filenames to md5sum as possible, which would be:

find . -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Then you won't have the file size of course. If you really need the file size, create a shell script, which does the md5sum and du -h and merge the lines with join.

Solution 2

Sometimes we are working on reduced sets of linux commands, like busybox or other things that comes with NAS and other linux embedded hardwares (IoTs). In these cases we can't use options like -print0, getting troubles with namespaces. So we may prefer instead:

find | while read file; do md5sum "$file"; done > /destination/file

Then, our /destination/file is ready for any kind of process like sort and uniq as usual.

Share:
9,538

Related videos on Youtube

Jamie Curran
Author by

Jamie Curran

Updated on September 18, 2022

Comments

  • Jamie Curran
    Jamie Curran almost 2 years

    I'm trying to find all duplicate files (based on MD5 hash) and ordered by file size. So far I have this:

    find . -type f -print0 | xargs -0 -I "{}" sh -c 'md5sum "{}" |  cut -f1 -d " " | tr "\n" " "; du -h "{}"' | sort -h -k2 -r | uniq -w32 --all-repeated=separate
    

    The output of this is:

    1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture.sh
    1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.sh
    1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.s
    
    d41d8cd98f00b204e9800998ecf8427e 0      ./test(1).log
    

    Is this the most efficient way?

    • Jamie Curran
      Jamie Curran over 11 years
      Ok, that's a fair point. But looking at this as a learning exercise for linux cmd, can this be improved? For instance, originally I started off with -exec 'md5sum.....' but research found (using google) xargs was more efficient.
    • Paulo Scardine
      Paulo Scardine over 11 years
      If you want to learn new techniques, I suggest looking how these tools are solving the problem and you will get a lot of clever ideas (the source, Luke, use the source).
    • artfulrobot
      artfulrobot about 9 years
  • Jody Bruchon
    Jody Bruchon almost 4 years
    XFS and APFS now have CoW features as well. jdupes does block-level/CoW dedupe on Linux BTRFS/XFS, and Apple APFS.