Efficiently delete large directory containing thousands of files

linux command-line files rm

385,818

Solution 1

Using rsync is surprising fast and simple.

mkdir empty_dir
rsync -a --delete empty_dir/    yourdirectory/

@sarath's answer mentioned another fast choice: Perl! Its benchmarks are faster than rsync -a --delete.

cd yourdirectory
perl -e 'for(<*>){((stat)[9]<(unlink))}'

or, without the stat (it's debatable whether it is needed; some say that may be faster with it, and others say it's faster without it):

cd yourdirectory
perl -e 'for(<*>){unlink}'

Sources:

Solution 2

Someone on Twitter suggested using -delete instead of -exec rm -f{} \;

This has improved the efficiency of the command, it still uses recursion to go through everything though.

Solution 3

A clever trick:

rsync -a --delete empty/ your_folder/

It's super CPU intensive, but really really fast. See https://web.archive.org/web/20130929001850/http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html

Solution 4

What about something like: find /path/to/folder -name "filenamestart*" -type f -print0 | xargs -0rn 20 rm -f

You can limit number of files to delete at once by changing the argument for parameter -n. The file names with blanks are included also.

Solution 5

Expanding on one of the comments, I do not think you're doing what you think you're doing.

First I created a huge amount of files, to simulate your situation:

$ mkdir foo
$ cd foo/
$ for X in $(seq 1 1000);do touch {1..1000}_$X; done

Then I tried what I expected to fail, and what it sounds like you're doing in the question:

$ rm -r foo/*
bash: /bin/rm: Argument list too long

But this does work:

$ rm -r foo/
$ ls foo
ls: cannot access foo: No such file or directory

View more solutions

385,818

Joshua

Updated on September 18, 2022

Comments

Joshua over 1 year

We have an issue with a folder becoming unwieldy with hundreds of thousands of tiny files.

There are so many files that performing rm -rf returns an error and instead what we need to do is something like:

find /path/to/folder -name "filenamestart*" -type f -exec rm -f {} \;

This works but is very slow and constantly fails from running out of memory.

Is there a better way to do this? Ideally I would like to remove the entire directory without caring about the contents inside it.
- Joshua about 12 years
  
  From memory that is what I was doing, I think because it recurses in to build out the list of files to delete before it deletes them?
- bbaja42 about 12 years
  
  Instead of deleting it manually, I suggest having the folder on a separate partition and simply unmount && format && remount.
- jw013 about 12 years
  
  Just out of curiosity - how many files does it take to break rm -rf?
- jw013 about 12 years
  
  You should probably rename the question to something more accurate, like "Efficiently delete large directory containing thousands of files." In order to delete a directory and its contents, recursion is necessary by definition. You could manually unlink just the directory inode itself (probably requires root privileges), unmount the file system, and run fsck on it to reclaim the unused disk blocks, but that approach seems risky and may not be any faster. In addition, the file system check might involve recursively traversing the file system tree anyways.
- frostschutz almost 11 years
  
  Once I had a ccache file tree so huge, and rm was taking so long (and making the entire system sluggish), it was considerably faster to copy all other files off the filesystem, format, and copy them back. Ever since then I give such massive small file trees their own dedicated filesystem, so you can mkfs directly instead of rm.
- evilsoup almost 11 years
  
  @jw013 see this question on SO -- it varies from system to system (and it's a bash limitation rather than an rm limitation), you can find out what your limit is with echo "$(getconf ARG_MAX)/4-1" | bc (mine comes to 524287 arguments, which I've tested and found to be correct).
- 200_success almost 11 years
  
  I find it implausible that find would fail due to running out of memory, since it executes rm immediately for each matching file, rather than building up a list. (Even if your command ended with + rather than \;, it would run rm in reasonably sized batches.) You would have to have a ridiculously deep directory structure to exhaust memory; the breadth shouldn't matter much.
- Marki555 almost 9 years
  
  The reason it is always quite slow with millions of files is that the filesystem must update its directory metadata and linked lists after each file is removed. It would be much faster if you could tell the filesystem that you don't need the entire directory, so it would throw out entire metadata at once.
- SDsolar almost 7 years
  
  Use the perl script in one of the answers, then rm to get the rest of it. WAY fast.
- RonJohn about 6 years
  
  Note that at some point you're going to run into the physical limit of disk speed. Both rsync -a --delete and find ... -type f --delete run at the same speed for me on an old RHEL 5.10 system for that reason.
Joshua about 12 years

ls won't work because of the amount of files in the folder. This is why I had to use find, thanks though.
enzotib about 12 years

This is non standard. GNU find have -delete, and other find maybe.
jw013 about 12 years

-delete should always be preferred to -exec rm when available, for reasons of safety and efficiency.
Useless about 12 years

You probably don't need the -n 20 bit, since xargs should limit itself to acceptable argument-list sizes anyway.
monkeyhouse about 12 years

Yes, you are right. Here is a note from man xargs : (...) max-chars characters per command line (...). The largest allowed value is system-dependent, and is calculated as the argument length limit for exec. So -n option is for such cases where xargs cannot determine the CLI buffer size or if the executed command has some limits.
Admin almost 11 years

shred will not work with many modern filesystems.
xenoterracide over 10 years

if you're going to use exec you almost certainly want not use -ls and do find . -type f -exec rm '{}' + + is faster because it will give as many arguments to rm as it can handle at once.
xenoterracide over 10 years

using + instead of \; would make this faster as it passes more arguments to rm at once, less forking
derobert over 10 years

I think you should go ahead and edit this into its own answer… it's really too long for a comment. Also, it sound like your filesystem has fairly expensive deletes, curious which one it is? You can run that find … -delete through nice or ionice, that may help. So might changing some mount options to less-crash-safe settings. (And, of course, depending on what else is on the filesystem, the quickest way to delete everything is often mkfs.)
maxschlepzig over 10 years

Does not work on filenames that contain newlines.
erik about 10 years

This is the only solution that worked: Run rm -Rf bigdirectory several times. I had a directory with thousands of millions of subdirectories and files. I couldn’t even run ls or find or rsync in that directory, because it ran out of memory. The command rm -Rf quit many times (out of memory) only deleting part of the billions of files. But after many retries it finally did the job. Seems to be the only solution if running out of memory is the problem.
Score_Under almost 10 years

Load average is not always CPU, it's just a measure of the number of blocked processes over time. Processes can block on disk I/O, which is likely what is happening here.
John Powell almost 10 years

Thanks, very useful. I use rsync all the time, I had no idea you could use it to delete like this. Vastly quicker than rm -rf
Marki555 almost 9 years

Using find -exec executes the rm command for every file separately, that's why it is so slow.
Marki555 almost 9 years

rsync can be faster than plain rm, because it guarantees the deletes in correct order, so less btress recomputation is needed. See this answer serverfault.com/a/328305/105902
Marki555 almost 9 years

Also note that load average does not account for number of logical CPUs. So loadavg 1 for single-core machine is the same as loadavg 64 on 64-core system - meaning each CPU is busy 100% of time.
Marki555 almost 9 years

@camh that's true. But removing files in sorted order is faster than in unsorted (because of recalculating the btree of the directory after each deletion). See this answer for an example serverfault.com/a/328305/105902
Marki555 almost 9 years

@maxschlepzig for such files you can use find . -print0 | xargs -0 rm, which will use the NULL char as filename separator.
Tal Ben Shalom over 8 years

Can anyone modify the perl expression to recursively delete all directories and files inside a directory_to_be_deleted ?
Drasill over 8 years

Notes : add -P option to rsync for some more display, also, be careful about the syntax, the trailing slashes are mandatory. Finally, you can start the rsync command a first time with the -n option first to launch a dry run.
Franck Dernoncourt over 8 years

If you're looking at iotop while removing the files, you might find this interesting: iotop showing 1.5 MB/s of disk write, but all programs have 0.00 B/s
Hastur over 8 years

@Marki555: in the Edit of the question it is reported 60 seconds for rsync -a --delete vs 43 for lsdent. The ratio 10x was for time ls -1 | wc -l vs time ./dentls bigfolder >out.txt (that is a partially fair comparison because of > file vs wc -l).
Joel Davey over 7 years

@mtk use the -P option to see what's going on, it does seem to do nothing for a while , when it builds the file list.
Kent Fredric over 7 years

Warning: Attached Perl code is possibly suboptimal, as some of the operations used don't have any reasonable justification. The cited article doesn't know why either, and the "stat" call demonstrably slows things down a small amount under testing: quora.com/…
SDsolar almost 7 years

Yay for Sarah. That perl one is very surprising how fast it is. I then clean it up with a final rm -rf <target> to get all the directories. PLUS, I then have to check the trash from previous efforts. Thank you for this. Way beyond what I could have written. I do not understand the [9] part. And I had to catch that you used real apostrophes instead of reverse ticks. And it works. Thank you from August 2017 - Ubuntu 16.04 LTS
Terry over 6 years

This is especially helpful when you are deleting an entire kernel source code directory. With rm -rf*, it would take more than 10 minutes (I actually never reached the end), but with the rsync example, it was a few seconds or total time for an entire Linux kernel source code directory. Good job on the answer.
RonJohn about 6 years

GNU is the de facto standard.
Svartalf over 5 years

The problem there is that NONE of the commands over there actually DO the desired traversal operation for deletion. The code they give? DOES NOT WORK as described by Marki555.
jtgd over 5 years

Why not ionice -c3 find <dir> -type f -delete
EvgenyKolyakov over 4 years

You can even shorden it to perl -e 'for(</path/to/your/dir/*>){((stat)[9]<(unlink))}'
Paul_Pedant over 4 years

The problem is that * does a shell expansion, which means: (a) it reads the entire directory, and then (b) sorts all the filenames, even before the find is invoked. Using ls -1 -U reads the directory in serial order. You can head -n 10000 and get a list to send to xargs rm. And because those names are all serial in the first part of the directory, they get deleted efficiently too. Just put that in a loop until no files are left, and it works pretty well.
Joshua Pinter over 4 years

Thanks for the reasoning @Paul_Pedant!
codenamezero over 4 years

That perl command don't work
dannysauer over 4 years

With GNU find, this is where -exec rm {} \+ comes in handy (specifically the \+ in place of \;), as it works like a built-in xargs without the minimal pipe and fork overhead. Still slower than other options, though.
schily over 4 years

@dannysauer execplus has been invented in 1988 by David Korn at AT&T and GNU find was the last find implementation to add support - more than 25 years later. BTW: the speed difference between the standard execplus and the nonstandard -delete is minimal.
dannysauer over 4 years

@schily, that's interesting, and I'm a huge fan of Korn's work. However, the answer we're commenting on suggests that testing was happening on Linux. "GNU find" was specified to distinguish from other possible minimal Linux implementations, like busybox. :)
Dave over 3 years

Just a warning - adding -delete to gnu find implicitly enables -depth, which takes you back to the problem of running out of memory during the scan.
Rehmat over 3 years

I started a tar command last night and it is still stuck on a PHP sessions folder. It is better to remove sessions folder. But, I am using OpenVZ where rm -rf command takes lot of CPU and it gets terminated because of abuse. Using rsync is light weight and does not go above 1% CPU usage. iUsed nodes from df -i shows 24710900.
Andrey about 3 years

stat() call in the Perl one-liner seems to be useless (its return value is compared to return value of unlink() which is pointless), so even more optimal version is perl -e 'for(<*>){unlink}'
Stephen Kitt about 3 years

@Andrey re your suggested edit, for some people at least combining stat and unlink results in faster deletion; have you had a chance to benchmark the two approaches?
Andrey about 3 years

@StephenKitt, no, I haven't had a chance to benchmark it yet, but I'll give it a try. If I read it correctly, Kent in his post found only one case when using stat+unlink is faster than solely unlink.
Rodrigo almost 3 years

Woooohooo the perl line deleted ~500K files in 15 seconds!! That one goes directly to my toolbox
Philippe Remy over 2 years

Not the fastest option: yonglhuang.com/rm-file
Freedo over 2 years

On ubuntu 18.04, perl just seems to run and do nothing
kaveh eyni over 2 years

not work, this command not for deleting file, just use this : 'rm -rf path/to/directory'
Asad Aizaz over 2 years

Ubuntu 20.04 that perl command does nothing. Does anybody have a recursive perl variant? And is there any way to get a progress bar for rsync? I tried -P and --info=progress2 but no progress bar.
Admin almost 2 years

rsync method also helped me with Value too large for defined data type error