Find all duplicate files by MD5 hash
Solution 1
From "man xargs": -I implies -L 1 So this is not most efficient. It would be more efficient, if you just give as many filenames to md5sum as possible, which would be:
find . -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Then you won't have the file size of course. If you really need the file size, create a shell script, which does the md5sum
and du -h
and merge the lines with join
.
Solution 2
Sometimes we are working on reduced sets of linux commands, like busybox or other things that comes with NAS and other linux embedded hardwares (IoTs). In these cases we can't use options like -print0
, getting troubles with namespaces. So we may prefer instead:
find | while read file; do md5sum "$file"; done > /destination/file
Then, our /destination/file
is ready for any kind of process like sort
and uniq
as usual.
Related videos on Youtube
Jamie Curran
Updated on September 18, 2022Comments
-
Jamie Curran almost 2 years
I'm trying to find all duplicate files (based on MD5 hash) and ordered by file size. So far I have this:
find . -type f -print0 | xargs -0 -I "{}" sh -c 'md5sum "{}" | cut -f1 -d " " | tr "\n" " "; du -h "{}"' | sort -h -k2 -r | uniq -w32 --all-repeated=separate
The output of this is:
1832348bb0c3b0b8a637a3eaf13d9f22 4.0K ./picture.sh 1832348bb0c3b0b8a637a3eaf13d9f22 4.0K ./picture2.sh 1832348bb0c3b0b8a637a3eaf13d9f22 4.0K ./picture2.s d41d8cd98f00b204e9800998ecf8427e 0 ./test(1).log
Is this the most efficient way?
-
Jamie Curran over 11 yearsOk, that's a fair point. But looking at this as a learning exercise for linux cmd, can this be improved? For instance, originally I started off with -exec 'md5sum.....' but research found (using google) xargs was more efficient.
-
Paulo Scardine over 11 yearsIf you want to learn new techniques, I suggest looking how these tools are solving the problem and you will get a lot of clever ideas (the source, Luke, use the source).
-
artfulrobot about 9 yearsSee also unix.stackexchange.com/a/71178/23542
-
-
Jody Bruchon almost 4 yearsXFS and APFS now have CoW features as well. jdupes does block-level/CoW dedupe on Linux BTRFS/XFS, and Apple APFS.