How to delete all duplicate hardlinks to a file?

files hard-link duplicate

5,829

Solution 1

In the end it wasn't too hard to do this manually, based on Stéphane's and xenoid's hints and some prior experience with find.
I had to adapt a few commands to work with FreeBSD's non-GNU tools — GNU find has the -printf option that could have replaced the -exec stat, but FreeBSD's find doesn't have that.

# create a list of "<inode number> <tab> <full file path>"
find rsnapshots -type f -links +1 -exec stat -f '%i%t%R' {} + > inodes.txt

# sort the list by inode number (to have consecutive blocks of duplicate files)
sort -n inodes.txt > inodes.sorted.txt

# remove the first file from each block (we want to keep one link per inode)
awk -F'\t' 'BEGIN {lastinode = 0} {inode = 0+$1; if (inode == lastinode) {print $2}; lastinode = inode}' inodes.sorted.txt > inodes.to-delete.txt

# delete duplicates (watch out for special characters in the filename, and possibly adjust the read command and double quotes accordingly)
cat inodes.to-delete.txt | while read line; do rm -f "$line"; done

Solution 2

To find the inodes that have more that one link:

 find . -exec stat -c '%i' {} \; | sort -n | uniq -d

Then you can iterate that list with

 find -inum {inode_number}

to list the files that share the inode. Which to remove is up to you.

Solution 3

If you know that all the files have hardlinks within a single directory hierarchy only, you can simply do

find inputdir -type f -links +1 -exec rm {} \;

The reason this works is that rm {} \; removes exactly one file immediately the stat() returns count more than 1. As a result, the hard link count of the other parts of the same inode will be decreased by 1 and if the file is then the only copy, the rm will not be run for that last file by the time find runs stat() against that file.

Note that if any file has hardlinked copies outside the inputdir this command will remove all copies within inputdir hierarchy!

Solution 4

rmlint will find and remove duplicates, including hardlinks. At the moment it has no options to remove only hardlinks. Removal is done via an autogenerated shell script, so you can review that script prior to deletion.

In general be careful when using duplicate file detectors in hardlink mode (eg fdupes -H) since they can sometimes mistakenly identify a file as its own duplicate (see "path doubles" discussion here).

Solution 5

I think you are mistaken, since you think deleting all the "other" links to a file will save space. The only space you will save is a directory entry, and even that is questionable.

All hard links to a file are equal. There are no "duplicates". Files on Linux are really identified by which filesystem they are on, and what Inode number they are on that filesystem.

So when you create a file, you create an Inode, where the blocks actually live, and you create a link to that file in some directory. That link just points at that inode. If you do a hard link from that directory entry to another place, you just create a second directory entry someplace pointing to that same file.

If you run ls -i on a file, you will see it's Inode number. If you want to find other hard links to that same inode, simply run:

find /TOP-OF-FILESYSTEM -type f -inum INODE-NUMBER

Where TOP-OF-FILESYSTEM is the mount point for that filesystem, INODE-NUMBER is the inode number for the file in question. Note that "-type f" is not mandatory but just speeds up the search since you only will look for files.

Note that running ls -il on a file also (by default) it's inode number.

You can test all of this by going to a scratch directory and creating a file, then creating another link to it:

cd ~/tmp
date > temp1
ln tmep1 temp2
ls -l temp*

View more solutions

5,829

n.st

Computer science student, electronics hobbyist, Linux and FreeBSD enthusiast. Contact info (email and IRC): https://voidptr.de

Updated on September 18, 2022

Comments

n.st almost 2 years

I've got a directory tree created by rsnapshot, which contains multiple snapshots of the same directory structure with all identical files replaced by hardlinks.

I would like to delete all those hardlink duplicates and keep only a single copy of every file (so I can later move all files into a sorted archive without having to touch identical files twice).

Is there a tool that does that?
So far I've only found tools that find duplicates and create hardlinks to replace them…
I guess I could list all files and their inode numbers and implement the deduplicating and deleting myself, but I don't want to reinvent the wheel here.
- Stéphane Chazelas about 8 years
  
  find . ! -type d -links +1 finds files that are linked to more than one directory.
Gilles 'SO- stop being evil' about 8 years

This is all true but it doesn't answer the question. I think you missed the point: this isn't about disk space, it's about processing the files without processing the same file twice under different names.
n.st about 8 years

Thanks for the explanation, but I'm already familiar with how hardlinks work and that they don't consume any noteworthy amount of disk space — like I said I just wanted to prune my old rsnapshot directory to only keep a single link to each inode, so I won't end up looking at the same file twice when I go through the data and sort it into my new archive.
Lee-Man about 8 years

Okay, that was not very clear. :) But most backup programs (even tar) understand hard links and don't waste space. But I'm glad you solved your problem.
Alessio about 8 years

find rsnapshots -type f -exec stat -f '%i%t%R' {} + | sort -k1,1 -u | cut -f2- will give you a sorted list of ALL filenames under the rsnapshots directory, with duplicate inodes removed. You can feed that in to archiving programs (e.g. tar). BTW, many archiving or backup programs (like tar, or rsync with the -H option) already know how to handle hardlinks (i.e. storing only the hardlink, rather than another copy of the file), so this isn't even necessary for them.
Alessio about 8 years

NOTE: the FreeBSD version of cut doesn't (yet?) suppport NUL-separated input, so the find pipeline above is only safe for filenames that don't contain newlines in them.
Admin about 2 years

Either I don’t understand your answer, or you don’t understand the question. The OP wants to “keep only a single copy of every file” — with your answer, they would end up with N copies of every file that was previously linked.
Admin about 2 years

Sorry I was under the impression coping just one snapshot folder would solve the issue. I guessed they wanted a snapshot that was unlinked.