How to remove duplicate files using bash

20,383

Solution 1

I'm working on Linux, which means the is the command md5sum which outputs:

> md5sum *
d41d8cd98f00b204e9800998ecf8427e  file_1
d41d8cd98f00b204e9800998ecf8427e  file_10
d41d8cd98f00b204e9800998ecf8427e  file_2
d41d8cd98f00b204e9800998ecf8427e  file_3
d41d8cd98f00b204e9800998ecf8427e  file_4
d41d8cd98f00b204e9800998ecf8427e  file_5
d41d8cd98f00b204e9800998ecf8427e  file_6
d41d8cd98f00b204e9800998ecf8427e  file_7
d41d8cd98f00b204e9800998ecf8427e  file_8
d41d8cd98f00b204e9800998ecf8427e  file_9
b026324c6904b2a9cb4b88d6d61c81d1  other_file_1
31d30eea8d0968d6458e0ad0027c9f80  other_file_10
26ab0db90d72e28ad0ba1e22ee510510  other_file_2
6d7fce9fee471194aa8b5b6e47267f03  other_file_3
48a24b70a0b376535542b996af517398  other_file_4
1dcca23355272056f04fe8bf20edfce0  other_file_5
9ae0ea9e3c9c6e1b9b6252c8395efdc1  other_file_6
84bc3da1b3e33a18e8d5e1bdd7a18d7a  other_file_7
c30f7472766d25af1dc80b3ffc9a58c7  other_file_8
7c5aba41f53293b712fd86d08ed5b36e  other_file_9

Now using awk and xargs the command would be:

md5sum * | \
sort | \
awk 'BEGIN{lasthash = ""} $1 == lasthash {print $2} {lasthash = $1}' | \
xargs rm

The awk part initializes lasthash with the empty string, which will not match any hash, and then checks for each line if the hash in lasthash is the same as the hash (first column) of the current file (second column). If it is, it prints it out. At the end of every step it will set lasthash to the hash of the current file (you could limit this to only be set if the hashes are different, but that should be a minor thing especially if you do not have many matching files). The filenames awk spits out are fed to rm with xargs, which basically calls rm with what the awk part gives us.

You probably need to filter directories before md5sum *.

Edit:

Using Marcins method you could also use this one:

comm -1 -2 \
  <(ls) | \
  <(md5sum * | \
    sort -k1 | \
    uniq -w 32 | \
    awk '{print $2}' | \
    sort) \
xargs rm

This substracts from the filelist optained by ls the first filename of each unique hash optained by md5sum * | sort -k1 | uniq -w 32 | awk '{print $2}'.

Solution 2

You can identify duplicate files using the following command:

md5sum * | sort -k1 | uniq -w 32 -d

Solution 3

I ran across fdupes as an answer to this similar question: https://superuser.com/questions/386199/how-to-remove-duplicated-files-in-a-directory

I was able to apt-get install fdupes on Ubuntu. You will definitely want to read the man page. In my case, I was able to get the desired results like so:

fdupes -qdN -r /ops/backup/

Which says "look recursively through /ops/backup and find all duplicate files: keep the first copy of any given file, and quietly remove the rest." This make it very easy to keep several dumps of an infrequent-write database.

Solution 4

md5sum * | sort -k1 | uniq -w 32 -d | cut -d' ' -f3 | xargs -I{} sh -c 'rm {}'
  1. take all the md5 values
  2. sort them so dupes are sequential for uniq
  3. run uniq to output dupes only
  4. cut the filename from the line with the md5 value
  5. repeatedly call delete on the filenames

Solution 5

If you are in a hurry (or you have lots of files) and want to avoid the overhead of a sort (it takes time) but don't mind the memory overhead of a hash table (or you have plenty of ram with your lots of files),

find . -type f -print0 | xargs -0 md5sum | awk 'h[$1]{ printf "%s\0", $2; next }{ h[$1] = $2 }' | xargs -0 rm

find . -type f -print0: Find all files and output them with null-terminated names

xargs -0 md5sum: calculate hashes in parallel (adjust -n max-args and -P max-procs as desired, see man xargs)

awk 'h[$1]{ printf "%s\0", $2; next }{ h[$1] = $2 }': if there is an entry in the awk hashtable containing the md5sum we are currently looking at, then print out the filename we are currently looking at, null-terminated. Otherwise, add the filename to the hash table.

xargs -0 rm: take the piped-in null terminated strings and send them to rm.

This is way faster than fdupes.

If you have a filename that contains a newline, awk will probably truncate it at the newline since md5sum separates records by newlines too.

This is based on https://stackoverflow.com/questions/11532157/remove-duplicate-lines-without-sorting and https://stackoverflow.com/questions/9133315/how-can-i-output-null-terminated-strings-in-awk

Share:
20,383

Related videos on Youtube

warren
Author by

warren

Updated on September 18, 2022

Comments

  • warren
    warren over 1 year

    I have a folder with duplicate (by md5sum (md5 on a Mac)) files, and I want to have a job scheduled to remove any found.

    However, I'm stuck on how to do this. What I have so far:

    md5 -r * | sort
    

    Which outputs something like this:

    04c5d52b7acdfbecd5f3bdd8a39bf8fb gordondam_en-au11915031300_1366x768.jpg
    1e88c6899920d2c192897c886e764fc2 fortbourtange_zh-cn9788197909_1366x768.jpg
    266ea304b15bf4a5650f95cf385b16de nebraskasupercell_fr-fr11286079811_1366x768.jpg
    324735b755c40d332213899fa545c463 grossescheidegg_en-us10868142387_1366x768.jpg
    3993028fcea692328e097de50b26f540 Soyuz Spacecraft Rolled Out For Launch of One Year Crew.png
    677bcd6006a305f4601bfb27699403b0 lechaustria_zh-cn7190263094_1366x768.jpg
    80d03451b88ec29bff7d48f292a25ce6 ontariosunrise_en-ca10284703762_1366x768.jpg
    b6d9d24531bc62d2a26244d24624c4b1 manateeday_row10617199289_1366x768.jpg
    ca1486dbdb31ef6af83e5a40809ec561 Grueling Coursework.jpg
    cdf26393577ac2a61b6ce85d22daed24 Star trails over Mauna Kea.jpg
    dc3ad6658d8f8155c74054991910f39c smoocave_en-au10358472670_1366x768.jpg
    dc3ad6658d8f8155c74054991910f39c smoocave_en-au10358472670_1366x7682.jpg
    

    How can I process based on the MD5 of the file to remove duplicates? I don't really care which "original" I keep - but I only want to keep one.

    Should I be approaching this in a different manner?

    • derobert
      derobert about 9 years
      There is already an fdupes command which will do this... Not sure which distro you're on, but its in a Debian package with the same name. Also, MD5 is a fairly weak hash nowadays; you probably want to use sha256sum or better sha512sum (which should actually be faster on normal PC hardware)
    • warren
      warren about 9 years
      @derobert - I know MD5 is a weak hash, but it's simple and good enough for comparing small files :)
    • derobert
      derobert about 9 years
      well, MD5 collisions are known and generateable with fairly modest amounts of CPU power, and there is a command you can substitute in to use SHA-512 instead, so...
    • warren
      warren about 9 years
      @derobert - I understand there are potential negatives to using md5 .. but it's more than adequate in this scenario (wanting to ensure I don't have duplicate files in a given directory)
    • derobert
      derobert about 9 years
      Fair enough. Main point of the comment was to point out fdupes.
  • warren
    warren about 9 years
    thanks: md5 -r * | sort -t ' ' -k 4 -r | awk 'BEGIN{lasthash = ""} $1 == lasthash {print $2} {lasthash = $1}' | xargs rm is the call on OS X (per @Stephen Kitt's suggestion, I put an echo 'will remove ' at the end before trying it with rm)
  • Stephen Kitt
    Stephen Kitt about 9 years
    You should probably edit your question then because the given solution doesn't match the example output you gave.
  • warren
    warren about 9 years
    @TristanStorch - good spot :)
  • meso_2600
    meso_2600 over 6 years
    @TristanStorch I don't get your comment "And you should change sort -t ' ' -k 4 -r to sort"
  • golimar
    golimar over 4 years
    This outputs only one file for each set of duplicates. If you want to delete all duplicates and keep only one, you can keep the ones you get with this command and delete the rest
  • warren
    warren about 3 years
    does this work with files that have spaces?
  • warren
    warren about 3 years
    unfortunately, -w is not supported on BSD/macOS
  • warren
    warren about 3 years
    unfortunately - -w doesn't work on macOS (or BSDs in general)
  • warren
    warren about 3 years
    unfortunately - -w won't work on macOS or BSDs