Deleting millions of files

11,783

Solution 1

Quicker is not necessarily what you want. You may want to actually run slower, so the deletion chews up fewer resources while it's running.

Use nice(1) to lower the priority of a command.

nice find . -name "*.gif" -delete

For I/O-bound processes nice(1) might not be sufficient. The Linux scheduler does take I/O into account, not just CPU, but you may want finer control over I/O priority.

ionice -c 2 -n 7 find . -name "*.gif" -delete

If that doesn't do it, you could also add a sleep to really slow it down.

find . -name "*.gif" -exec sleep 0.01 \; -delete

Solution 2

No.

There is no quicker way, appart from soft-format of the disk. The files are given to rm at once (up to the limit of the command line, it could be also set to the xargs) which is much better than calling rm on each file. So no, there is definitely no faster way.

Using nice (or renice on a running process) helps only partially, because that is for scheduling the CPU resource, not disk! And the CPU usage will be very low. This is a linux weakness - if one process "eats up" the disk (i.e. works a lot with it), the whole machine gets stuck. Modified kernel for real time usage could be a solution.

What I would do on the server is to manually let other processes do their job - include pauses to keep the server "breathe":

find . -name "*.gif" > files
split -l 100 files files.
for F in files.* do
    cat $F | xargs rm
    sleep 5 
done

This will wait 5 seconds after every 100 files. It will take much longer but your customers shouldn't notice any delays.

Solution 3

If the number of files that are to be deleted vastly outnumbers the files which are left behind, it may not be the most efficient approach to walk the tree of files to be deleted and do all those filesystem updates. (It analogous to doing doing clumsy reference-counted memory management, visiting every object in a large tree to drop its reference, instead of making everything unwanted into garbage in one step, and then sweeping through what is reachable to clean up.)

That is to say, clone the parts of the tree that are to be kept to another volume. Re-create a fresh, blank filesystem on the original volume. Copy the retained files back to their original paths. This is vaguely similar to copying garbage collection.

There will be some downtime, but it could be better than continuous bad performance and service disruption.

It may be impractical in your system and situation, but it's easy to imagine obvious cases where this is the way to go.

For instance, suppose you wanted to delete all files in a filesystem. What would be the point of recursing and deleting one by one? Just unmount it and do a "mkfs" over top of the partition to make a blank filesystem.

Or suppose you wanted to delete all files except for half a dozen important ones? Get the half a dozen out of there and ... "mkfs" over top.

Eventually there is some break-even point when there are enough files that have to stay, that it becomes cheaper to do the recursive deletion, taking into account other costs like any downtime.

Solution 4

Have you tried:

find . -name "*.gif" -exec rm {} +

The + sign at the end will cause find to include more files for the single rm command to be executed. Check this question for more details.

Share:
11,783

Related videos on Youtube

Corepuncher
Author by

Corepuncher

Baaaaaaaaaaa!

Updated on September 18, 2022

Comments

  • Corepuncher
    Corepuncher over 1 year

    I had a dir fill up with millions of gif images. Too many for rm command.

    I have been trying the find command like this:

    find . -name "*.gif" -print0 | xargs -0 rm
    

    Problem is, it bogs down my machine really bad, and causes time outs for customers since it's a server.

    Is there any way that is quicker to delete all these files...without locking up the machine?

  • Clint C.
    Clint C. over 10 years
    It execute much faster than -print0 | xargs solution because rm process is not invoked for every file but for large set of them and therefore it is causing lower load.
  • CodeGnome
    CodeGnome over 10 years
    @JohnKugelman You are correct, but it's a GNU extension that isn't always available with the native find command.
  • Tomas
    Tomas over 10 years
    OK, interesting, but this is quite new thing (as well as -delete) which doesn't always have to be there..
  • Tomas
    Tomas over 10 years
    However this certainly brings nothing better compared to the OP's solution.
  • MSalters
    MSalters over 10 years
    @Joker_vD: Compatibility with a Unix decision from 1970 or so. Windows doesn't do it. There, programs can pass wildcards to FindNextFile/FindNextFile, so they get the results one at a time.
  • John Kugelman
    John Kugelman over 10 years
    @Joker_vD There are a lot of good things about the shell doing wildcard expansion. It's different from Windows, but don't jump to the conclusion that it's incredibly stupid merely because it's different from what you're used to. If you want to know more, I encourage you to Google it or post a question on the relevant Stack Exchange site. It's a huge derail for this comment area.
  • Christopher Lörken
    Christopher Lörken almost 9 years
    @Ola I completely agree. It should be ionice -c 3. Everything else chokes up a production server. Anyone: See anser of @user2719058 and the comments for the optimal solution in a high load environment.