Efficiently delete large directory containing thousands of files

385,818

Solution 1

Using rsync is surprising fast and simple.

mkdir empty_dir
rsync -a --delete empty_dir/    yourdirectory/

@sarath's answer mentioned another fast choice: Perl!  Its benchmarks are faster than rsync -a --delete.

cd yourdirectory
perl -e 'for(<*>){((stat)[9]<(unlink))}'

or, without the stat (it's debatable whether it is needed; some say that may be faster with it, and others say it's faster without it):

cd yourdirectory
perl -e 'for(<*>){unlink}'

Sources:

  1. https://stackoverflow.com/questions/1795370/unix-fast-remove-directory-for-cleaning-up-daily-builds
  2. http://www.slashroot.in/which-is-the-fastest-method-to-delete-files-in-linux
  3. https://www.quora.com/Linux-why-stat+unlink-can-be-faster-than-a-single-unlink/answer/Kent-Fredric?srid=O9EW&share=1

Solution 2

Someone on Twitter suggested using -delete instead of -exec rm -f{} \;

This has improved the efficiency of the command, it still uses recursion to go through everything though.

Solution 3

A clever trick:

rsync -a --delete empty/ your_folder/

It's super CPU intensive, but really really fast. See https://web.archive.org/web/20130929001850/http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html

Solution 4

What about something like: find /path/to/folder -name "filenamestart*" -type f -print0 | xargs -0rn 20 rm -f

You can limit number of files to delete at once by changing the argument for parameter -n. The file names with blanks are included also.

Solution 5

Expanding on one of the comments, I do not think you're doing what you think you're doing.

First I created a huge amount of files, to simulate your situation:

$ mkdir foo
$ cd foo/
$ for X in $(seq 1 1000);do touch {1..1000}_$X; done

Then I tried what I expected to fail, and what it sounds like you're doing in the question:

$ rm -r foo/*
bash: /bin/rm: Argument list too long

But this does work:

$ rm -r foo/
$ ls foo
ls: cannot access foo: No such file or directory
Share:
385,818

Related videos on Youtube

Joshua
Author by

Joshua

Updated on September 18, 2022

Comments

  • Joshua
    Joshua over 1 year

    We have an issue with a folder becoming unwieldy with hundreds of thousands of tiny files.

    There are so many files that performing rm -rf returns an error and instead what we need to do is something like:

    find /path/to/folder -name "filenamestart*" -type f -exec rm -f {} \;

    This works but is very slow and constantly fails from running out of memory.

    Is there a better way to do this? Ideally I would like to remove the entire directory without caring about the contents inside it.

    • Joshua
      Joshua about 12 years
      From memory that is what I was doing, I think because it recurses in to build out the list of files to delete before it deletes them?
    • bbaja42
      bbaja42 about 12 years
      Instead of deleting it manually, I suggest having the folder on a separate partition and simply unmount && format && remount.
    • jw013
      jw013 about 12 years
      Just out of curiosity - how many files does it take to break rm -rf?
    • jw013
      jw013 about 12 years
      You should probably rename the question to something more accurate, like "Efficiently delete large directory containing thousands of files." In order to delete a directory and its contents, recursion is necessary by definition. You could manually unlink just the directory inode itself (probably requires root privileges), unmount the file system, and run fsck on it to reclaim the unused disk blocks, but that approach seems risky and may not be any faster. In addition, the file system check might involve recursively traversing the file system tree anyways.
    • frostschutz
      frostschutz almost 11 years
      Once I had a ccache file tree so huge, and rm was taking so long (and making the entire system sluggish), it was considerably faster to copy all other files off the filesystem, format, and copy them back. Ever since then I give such massive small file trees their own dedicated filesystem, so you can mkfs directly instead of rm.
    • evilsoup
      evilsoup almost 11 years
      @jw013 see this question on SO -- it varies from system to system (and it's a bash limitation rather than an rm limitation), you can find out what your limit is with echo "$(getconf ARG_MAX)/4-1" | bc (mine comes to 524287 arguments, which I've tested and found to be correct).
    • 200_success
      200_success almost 11 years
      I find it implausible that find would fail due to running out of memory, since it executes rm immediately for each matching file, rather than building up a list. (Even if your command ended with + rather than \;, it would run rm in reasonably sized batches.) You would have to have a ridiculously deep directory structure to exhaust memory; the breadth shouldn't matter much.
    • Marki555
      Marki555 almost 9 years
      The reason it is always quite slow with millions of files is that the filesystem must update its directory metadata and linked lists after each file is removed. It would be much faster if you could tell the filesystem that you don't need the entire directory, so it would throw out entire metadata at once.
    • SDsolar
      SDsolar almost 7 years
      Use the perl script in one of the answers, then rm to get the rest of it. WAY fast.
    • RonJohn
      RonJohn about 6 years
      Note that at some point you're going to run into the physical limit of disk speed. Both rsync -a --delete and find ... -type f --delete run at the same speed for me on an old RHEL 5.10 system for that reason.
  • Joshua
    Joshua about 12 years
    ls won't work because of the amount of files in the folder. This is why I had to use find, thanks though.
  • enzotib
    enzotib about 12 years
    This is non standard. GNU find have -delete, and other find maybe.
  • jw013
    jw013 about 12 years
    -delete should always be preferred to -exec rm when available, for reasons of safety and efficiency.
  • Useless
    Useless about 12 years
    You probably don't need the -n 20 bit, since xargs should limit itself to acceptable argument-list sizes anyway.
  • monkeyhouse
    monkeyhouse about 12 years
    Yes, you are right. Here is a note from man xargs : (...) max-chars characters per command line (...). The largest allowed value is system-dependent, and is calculated as the argument length limit for exec. So -n option is for such cases where xargs cannot determine the CLI buffer size or if the executed command has some limits.
  • Admin
    Admin almost 11 years
    shred will not work with many modern filesystems.
  • xenoterracide
    xenoterracide over 10 years
    if you're going to use exec you almost certainly want not use -ls and do find . -type f -exec rm '{}' + + is faster because it will give as many arguments to rm as it can handle at once.
  • xenoterracide
    xenoterracide over 10 years
    using + instead of \; would make this faster as it passes more arguments to rm at once, less forking
  • derobert
    derobert over 10 years
    I think you should go ahead and edit this into its own answer… it's really too long for a comment. Also, it sound like your filesystem has fairly expensive deletes, curious which one it is? You can run that find … -delete through nice or ionice, that may help. So might changing some mount options to less-crash-safe settings. (And, of course, depending on what else is on the filesystem, the quickest way to delete everything is often mkfs.)
  • maxschlepzig
    maxschlepzig over 10 years
    Does not work on filenames that contain newlines.
  • erik
    erik about 10 years
    This is the only solution that worked: Run rm -Rf bigdirectory several times. I had a directory with thousands of millions of subdirectories and files. I couldn’t even run ls or find or rsync in that directory, because it ran out of memory. The command rm -Rf quit many times (out of memory) only deleting part of the billions of files. But after many retries it finally did the job. Seems to be the only solution if running out of memory is the problem.
  • Score_Under
    Score_Under almost 10 years
    Load average is not always CPU, it's just a measure of the number of blocked processes over time. Processes can block on disk I/O, which is likely what is happening here.
  • John Powell
    John Powell almost 10 years
    Thanks, very useful. I use rsync all the time, I had no idea you could use it to delete like this. Vastly quicker than rm -rf
  • Marki555
    Marki555 almost 9 years
    Using find -exec executes the rm command for every file separately, that's why it is so slow.
  • Marki555
    Marki555 almost 9 years
    rsync can be faster than plain rm, because it guarantees the deletes in correct order, so less btress recomputation is needed. See this answer serverfault.com/a/328305/105902
  • Marki555
    Marki555 almost 9 years
    Also note that load average does not account for number of logical CPUs. So loadavg 1 for single-core machine is the same as loadavg 64 on 64-core system - meaning each CPU is busy 100% of time.
  • Marki555
    Marki555 almost 9 years
    @camh that's true. But removing files in sorted order is faster than in unsorted (because of recalculating the btree of the directory after each deletion). See this answer for an example serverfault.com/a/328305/105902
  • Marki555
    Marki555 almost 9 years
    @maxschlepzig for such files you can use find . -print0 | xargs -0 rm, which will use the NULL char as filename separator.
  • Tal Ben Shalom
    Tal Ben Shalom over 8 years
    Can anyone modify the perl expression to recursively delete all directories and files inside a directory_to_be_deleted ?
  • Drasill
    Drasill over 8 years
    Notes : add -P option to rsync for some more display, also, be careful about the syntax, the trailing slashes are mandatory. Finally, you can start the rsync command a first time with the -n option first to launch a dry run.
  • Franck Dernoncourt
    Franck Dernoncourt over 8 years
    If you're looking at iotop while removing the files, you might find this interesting: iotop showing 1.5 MB/s of disk write, but all programs have 0.00 B/s
  • Hastur
    Hastur over 8 years
    @Marki555: in the Edit of the question it is reported 60 seconds for rsync -a --delete vs 43 for lsdent. The ratio 10x was for time ls -1 | wc -l vs time ./dentls bigfolder >out.txt (that is a partially fair comparison because of > file vs wc -l).
  • Joel Davey
    Joel Davey over 7 years
    @mtk use the -P option to see what's going on, it does seem to do nothing for a while , when it builds the file list.
  • Kent Fredric
    Kent Fredric over 7 years
    Warning: Attached Perl code is possibly suboptimal, as some of the operations used don't have any reasonable justification. The cited article doesn't know why either, and the "stat" call demonstrably slows things down a small amount under testing: quora.com/…
  • SDsolar
    SDsolar almost 7 years
    Yay for Sarah. That perl one is very surprising how fast it is. I then clean it up with a final rm -rf <target> to get all the directories. PLUS, I then have to check the trash from previous efforts. Thank you for this. Way beyond what I could have written. I do not understand the [9] part. And I had to catch that you used real apostrophes instead of reverse ticks. And it works. Thank you from August 2017 - Ubuntu 16.04 LTS
  • Terry
    Terry over 6 years
    This is especially helpful when you are deleting an entire kernel source code directory. With rm -rf*, it would take more than 10 minutes (I actually never reached the end), but with the rsync example, it was a few seconds or total time for an entire Linux kernel source code directory. Good job on the answer.
  • RonJohn
    RonJohn about 6 years
    GNU is the de facto standard.
  • Svartalf
    Svartalf over 5 years
    The problem there is that NONE of the commands over there actually DO the desired traversal operation for deletion. The code they give? DOES NOT WORK as described by Marki555.
  • jtgd
    jtgd over 5 years
    Why not ionice -c3 find <dir> -type f -delete
  • EvgenyKolyakov
    EvgenyKolyakov over 4 years
    You can even shorden it to perl -e 'for(</path/to/your/dir/*>){((stat)[9]<(unlink))}'
  • Paul_Pedant
    Paul_Pedant over 4 years
    The problem is that * does a shell expansion, which means: (a) it reads the entire directory, and then (b) sorts all the filenames, even before the find is invoked. Using ls -1 -U reads the directory in serial order. You can head -n 10000 and get a list to send to xargs rm. And because those names are all serial in the first part of the directory, they get deleted efficiently too. Just put that in a loop until no files are left, and it works pretty well.
  • Joshua Pinter
    Joshua Pinter over 4 years
    Thanks for the reasoning @Paul_Pedant!
  • codenamezero
    codenamezero over 4 years
    That perl command don't work
  • dannysauer
    dannysauer over 4 years
    With GNU find, this is where -exec rm {} \+ comes in handy (specifically the \+ in place of \;), as it works like a built-in xargs without the minimal pipe and fork overhead. Still slower than other options, though.
  • schily
    schily over 4 years
    @dannysauer execplus has been invented in 1988 by David Korn at AT&T and GNU find was the last find implementation to add support - more than 25 years later. BTW: the speed difference between the standard execplus and the nonstandard -delete is minimal.
  • dannysauer
    dannysauer over 4 years
    @schily, that's interesting, and I'm a huge fan of Korn's work. However, the answer we're commenting on suggests that testing was happening on Linux. "GNU find" was specified to distinguish from other possible minimal Linux implementations, like busybox. :)
  • Dave
    Dave over 3 years
    Just a warning - adding -delete to gnu find implicitly enables -depth, which takes you back to the problem of running out of memory during the scan.
  • Rehmat
    Rehmat over 3 years
    I started a tar command last night and it is still stuck on a PHP sessions folder. It is better to remove sessions folder. But, I am using OpenVZ where rm -rf command takes lot of CPU and it gets terminated because of abuse. Using rsync is light weight and does not go above 1% CPU usage. iUsed nodes from df -i shows 24710900.
  • Andrey
    Andrey about 3 years
    stat() call in the Perl one-liner seems to be useless (its return value is compared to return value of unlink() which is pointless), so even more optimal version is perl -e 'for(<*>){unlink}'
  • Stephen Kitt
    Stephen Kitt about 3 years
    @Andrey re your suggested edit, for some people at least combining stat and unlink results in faster deletion; have you had a chance to benchmark the two approaches?
  • Andrey
    Andrey about 3 years
    @StephenKitt, no, I haven't had a chance to benchmark it yet, but I'll give it a try. If I read it correctly, Kent in his post found only one case when using stat+unlink is faster than solely unlink.
  • Rodrigo
    Rodrigo almost 3 years
    Woooohooo the perl line deleted ~500K files in 15 seconds!! That one goes directly to my toolbox
  • Philippe Remy
    Philippe Remy over 2 years
    Not the fastest option: yonglhuang.com/rm-file
  • Freedo
    Freedo over 2 years
    On ubuntu 18.04, perl just seems to run and do nothing
  • kaveh eyni
    kaveh eyni over 2 years
    not work, this command not for deleting file, just use this : 'rm -rf path/to/directory'
  • Asad Aizaz
    Asad Aizaz over 2 years
    Ubuntu 20.04 that perl command does nothing. Does anybody have a recursive perl variant? And is there any way to get a progress bar for rsync? I tried -P and --info=progress2 but no progress bar.
  • Admin
    Admin almost 2 years
    rsync method also helped me with Value too large for defined data type error