Search and Delete duplicate files with different names

12,089

Solution 1

There is such a program, and it's called rdfind:

SYNOPSIS
   rdfind [ options ] directory1 | file1 [ directory2 | file2 ] ...

DESCRIPTION
   rdfind  finds duplicate files across and/or within several directories.
   It calculates checksum only if necessary.  rdfind  runs  in  O(Nlog(N))
   time with N being the number of files.

   If  two  (or  more) equal files are found, the program decides which of
   them is the original and the rest are considered  duplicates.  This  is
   done  by  ranking  the  files  to each other and deciding which has the
   highest rank. See section RANKING for details.

It can delete the duplicates, or replace them with symbolic or hard links.

Solution 2

Hmmph. I just developed a one-liner to list all duplicates, for a question that turned out to be a duplicate of this. How meta. Well, shame to waste it, so I'll post it, though rdfind sounds like a better solution.

This at least has the advantage of being the "real" Unix way to do it ;)

find -name '*.mp3' -print0 | xargs -0 md5sum | sort | uniq -Dw 32

Breaking the pipeline down:

find -name '*.mp3' -print0 finds all mp3 files in the subtree starting at the current directory, printing the names NUL-separated.

xargs -0 md5sum reads the NUL-separated list and computes a checksum on each file.

You know what sort does.

uniq -Dw 32 compares the first 32 characters of the sorted lines and prints only the ones that have the same hash.

So you end up with a list of all duplicates. You can then whittle that down manually to the ones you want to delete, remove the hashes, and pipe the list to rm.

Solution 3

I'm glad you got the job done with rdfind.

Next time you could also consider rmlint. It's extremely fast and offers a few different options to help determine which file is the original in each set of duplicates.

Solution 4

I'd be thinking of using Perl:

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use Digest::SHA qw ( sha1_hex );

my %seen;

sub find_dupes {
    next if -d;
    local $/;
    open( my $input, "<", $File::Find::name ) or warn $!;
    my $sha1sum = sha1_hex(<$input>);
    close($input);
    if ( $seen{$sha1sum} ) {
        print "$File::Find::name is probably a dupe of $seen{$sha1sum} - both have $sha1sum\n";
    }
    $seen{$sha1sum} = $File::Find::name;
}

find( \&find_dupes, "/path/to/search", "/another/path/to/search" );
Share:
12,089

Related videos on Youtube

Cestarian
Author by

Cestarian

Updated on September 18, 2022

Comments

  • Cestarian
    Cestarian almost 2 years

    I have a large music collection stored on my hard drive; and browsing through it, I found that I have a lot of duplicate files in some album directories. Usually the duplicates exist alongside the original in the same directory.

    Usually the format is filename.mp3 and duplicate file is filename 1.mp3. Sometimes there may be more than one duplicate file, and I have no idea if there are duplicate files across folders (for example duplicates of album directories).

    Is there any way I can scan for these duplicate files (for example by comparing filesize, or comparing the entire files to check if they are identical), review the results, and then delete the duplicates? The ones that have a longer name, or the ones that have a more recent modified/created date would usually be the targets of deletion.

    Is there a program out there that can do this on Linux?

    • Admin
      Admin about 9 years
    • Admin
      Admin about 9 years
      @VincentNivoliers Thanks, I guess that my question is a duplicate in the end, although I wasn't asking for a program specifically for music files (I just used music as an example) that question does have the answers I need to solve my specific problem.
    • Admin
      Admin about 9 years
      I would say that if your files are identical to the bits, with different names, then the question would hold, and I would suggest using a hash program combined with a hash table to propose duplicates. For music collections, this is probably not the case if the equivalent files come from different sources.
    • Admin
      Admin about 9 years
      Yes, this was what I meant, music files were just my example, but since my exact scenario does involve music files the other thread probably has a good solution for me already. Hash program sounds like it might be a good solution independent of filetypes, know any?
    • Admin
      Admin about 9 years
  • Cestarian
    Cestarian about 9 years
    I'm trying this program out now.
  • Cestarian
    Cestarian about 9 years
    This worked pretty well, rdfind /mnt/stash/music told me a total of 1 GB could be cleared up, and created a results.txt file to list all the duplicates. rdfind -deleteduplicates true /mnt/stash/music then deleted 2104 duplicate files for me. Thanks! This program was very high performance, it took only a minute to initially scan through my 200+GB music folder and only a few seconds to delete the duplicate files on it's second run. It would have been nice if it would also delete empty folders though.
  • Cestarian
    Cestarian about 9 years
    Yeah, I did not like rdfind's approach to finding the originals, it's happened that files I would have considered the duplicate were not the ones deleted (i.e. the original was deleted) it doesn't really bother me though, but if I was OCD...
  • Cestarian
    Cestarian over 8 years
    For the record, this is what I was originally hoping to see as an answer :P A one liner without the need to download any extra software (not that I have anything against that, I just like being "clean") Then again rdfind has less legwork, manually sorting out and removing files with matching checksums is a bit of work, ideally it should be done automatically by say deleting all files with the same checksum save for the one with the shortest filename.
  • golimar
    golimar over 6 years
    Nice one-liner. One of the things that rdfind does better is to check file sizes first in order to exclude unique files from the list
  • ashleedawg
    ashleedawg almost 4 years
    didn't work for me (command not found) but the other answer did