Efficiently delete large directory containing thousands of files
Solution 1
Using rsync is surprising fast and simple.
mkdir empty_dir
rsync -a --delete empty_dir/ yourdirectory/
@sarath's answer mentioned another fast choice: Perl!
Its benchmarks are faster than rsync -a --delete
.
cd yourdirectory
perl -e 'for(<*>){((stat)[9]<(unlink))}'
or, without the stat
(it's debatable whether it is needed;
some say that may be faster with it, and others say it's faster without it):
cd yourdirectory
perl -e 'for(<*>){unlink}'
Sources:
- https://stackoverflow.com/questions/1795370/unix-fast-remove-directory-for-cleaning-up-daily-builds
- http://www.slashroot.in/which-is-the-fastest-method-to-delete-files-in-linux
- https://www.quora.com/Linux-why-stat+unlink-can-be-faster-than-a-single-unlink/answer/Kent-Fredric?srid=O9EW&share=1
Solution 2
Someone on Twitter suggested using -delete
instead of -exec rm -f{} \;
This has improved the efficiency of the command, it still uses recursion to go through everything though.
Solution 3
A clever trick:
rsync -a --delete empty/ your_folder/
It's super CPU intensive, but really really fast. See https://web.archive.org/web/20130929001850/http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html
Solution 4
What about something like:
find /path/to/folder -name "filenamestart*" -type f -print0 | xargs -0rn 20 rm -f
You can limit number of files to delete at once by changing the argument for parameter -n
. The file names with blanks are included also.
Solution 5
Expanding on one of the comments, I do not think you're doing what you think you're doing.
First I created a huge amount of files, to simulate your situation:
$ mkdir foo
$ cd foo/
$ for X in $(seq 1 1000);do touch {1..1000}_$X; done
Then I tried what I expected to fail, and what it sounds like you're doing in the question:
$ rm -r foo/*
bash: /bin/rm: Argument list too long
But this does work:
$ rm -r foo/
$ ls foo
ls: cannot access foo: No such file or directory
Related videos on Youtube
Joshua
Updated on September 18, 2022Comments
-
Joshua over 1 year
We have an issue with a folder becoming unwieldy with hundreds of thousands of tiny files.
There are so many files that performing
rm -rf
returns an error and instead what we need to do is something like:find /path/to/folder -name "filenamestart*" -type f -exec rm -f {} \;
This works but is very slow and constantly fails from running out of memory.
Is there a better way to do this? Ideally I would like to remove the entire directory without caring about the contents inside it.
-
Joshua about 12 yearsFrom memory that is what I was doing, I think because it recurses in to build out the list of files to delete before it deletes them?
-
bbaja42 about 12 yearsInstead of deleting it manually, I suggest having the folder on a separate partition and simply unmount && format && remount.
-
jw013 about 12 yearsJust out of curiosity - how many files does it take to break
rm -rf
? -
jw013 about 12 yearsYou should probably rename the question to something more accurate, like "Efficiently delete large directory containing thousands of files." In order to delete a directory and its contents, recursion is necessary by definition. You could manually unlink just the directory inode itself (probably requires root privileges), unmount the file system, and run
fsck
on it to reclaim the unused disk blocks, but that approach seems risky and may not be any faster. In addition, the file system check might involve recursively traversing the file system tree anyways. -
frostschutz almost 11 yearsOnce I had a
ccache
file tree so huge, andrm
was taking so long (and making the entire system sluggish), it was considerably faster to copy all other files off the filesystem, format, and copy them back. Ever since then I give such massive small file trees their own dedicated filesystem, so you canmkfs
directly instead ofrm
. -
evilsoup almost 11 years@jw013 see this question on SO -- it varies from system to system (and it's a bash limitation rather than an rm limitation), you can find out what your limit is with
echo "$(getconf ARG_MAX)/4-1" | bc
(mine comes to 524287 arguments, which I've tested and found to be correct). -
200_success almost 11 yearsI find it implausible that
find
would fail due to running out of memory, since it executesrm
immediately for each matching file, rather than building up a list. (Even if your command ended with+
rather than\;
, it would runrm
in reasonably sized batches.) You would have to have a ridiculously deep directory structure to exhaust memory; the breadth shouldn't matter much. -
Marki555 almost 9 yearsThe reason it is always quite slow with millions of files is that the filesystem must update its directory metadata and linked lists after each file is removed. It would be much faster if you could tell the filesystem that you don't need the entire directory, so it would throw out entire metadata at once.
-
SDsolar almost 7 yearsUse the perl script in one of the answers, then rm to get the rest of it. WAY fast.
-
RonJohn about 6 yearsNote that at some point you're going to run into the physical limit of disk speed. Both
rsync -a --delete
andfind ... -type f --delete
run at the same speed for me on an old RHEL 5.10 system for that reason.
-
-
Joshua about 12 years
ls
won't work because of the amount of files in the folder. This is why I had to usefind
, thanks though. -
enzotib about 12 yearsThis is non standard. GNU
find
have-delete
, and otherfind
maybe. -
jw013 about 12 years
-delete
should always be preferred to-exec rm
when available, for reasons of safety and efficiency. -
Useless about 12 yearsYou probably don't need the
-n 20
bit, since xargs should limit itself to acceptable argument-list sizes anyway. -
monkeyhouse about 12 yearsYes, you are right. Here is a note from
man xargs
:(...) max-chars characters per command line (...). The largest allowed value is system-dependent, and is calculated as the argument length limit for exec
. So-n
option is for such cases where xargs cannot determine the CLI buffer size or if the executed command has some limits. -
Admin almost 11 years
shred
will not work with many modern filesystems. -
xenoterracide over 10 yearsif you're going to use
exec
you almost certainly want not use-ls
and dofind . -type f -exec rm '{}' +
+ is faster because it will give as many arguments to rm as it can handle at once. -
xenoterracide over 10 yearsusing
+
instead of\;
would make this faster as it passes more arguments to rm at once, less forking -
derobert over 10 yearsI think you should go ahead and edit this into its own answer… it's really too long for a comment. Also, it sound like your filesystem has fairly expensive deletes, curious which one it is? You can run that
find … -delete
throughnice
orionice
, that may help. So might changing some mount options to less-crash-safe settings. (And, of course, depending on what else is on the filesystem, the quickest way to delete everything is oftenmkfs
.) -
maxschlepzig over 10 yearsDoes not work on filenames that contain newlines.
-
erik about 10 yearsThis is the only solution that worked: Run
rm -Rf bigdirectory
several times. I had a directory with thousands of millions of subdirectories and files. I couldn’t even runls
orfind
orrsync
in that directory, because it ran out of memory. The commandrm -Rf
quit many times (out of memory) only deleting part of the billions of files. But after many retries it finally did the job. Seems to be the only solution if running out of memory is the problem. -
Score_Under almost 10 yearsLoad average is not always CPU, it's just a measure of the number of blocked processes over time. Processes can block on disk I/O, which is likely what is happening here.
-
John Powell almost 10 yearsThanks, very useful. I use rsync all the time, I had no idea you could use it to delete like this. Vastly quicker than rm -rf
-
Marki555 almost 9 yearsUsing
find -exec
executes therm
command for every file separately, that's why it is so slow. -
Marki555 almost 9 years
rsync
can be faster than plainrm
, because it guarantees the deletes in correct order, so less btress recomputation is needed. See this answer serverfault.com/a/328305/105902 -
Marki555 almost 9 yearsAlso note that load average does not account for number of logical CPUs. So loadavg
1
for single-core machine is the same as loadavg64
on 64-core system - meaning each CPU is busy 100% of time. -
Marki555 almost 9 years@camh that's true. But removing files in sorted order is faster than in unsorted (because of recalculating the btree of the directory after each deletion). See this answer for an example serverfault.com/a/328305/105902
-
Marki555 almost 9 years@maxschlepzig for such files you can use
find . -print0 | xargs -0 rm
, which will use the NULL char as filename separator. -
Tal Ben Shalom over 8 yearsCan anyone modify the perl expression to recursively delete all directories and files inside a directory_to_be_deleted ?
-
Drasill over 8 yearsNotes : add
-P
option to rsync for some more display, also, be careful about the syntax, the trailing slashes are mandatory. Finally, you can start the rsync command a first time with the-n
option first to launch a dry run. -
Franck Dernoncourt over 8 yearsIf you're looking at
iotop
while removing the files, you might find this interesting: iotop showing 1.5 MB/s of disk write, but all programs have 0.00 B/s -
Hastur over 8 years@Marki555: in the Edit of the question it is reported 60 seconds for
rsync -a --delete
vs 43 forlsdent
. The ratio 10x was fortime ls -1 | wc -l
vstime ./dentls bigfolder >out.txt
(that is a partially fair comparison because of> file
vswc -l
). -
Joel Davey over 7 years@mtk use the -P option to see what's going on, it does seem to do nothing for a while , when it builds the file list.
-
Kent Fredric over 7 yearsWarning: Attached Perl code is possibly suboptimal, as some of the operations used don't have any reasonable justification. The cited article doesn't know why either, and the "stat" call demonstrably slows things down a small amount under testing: quora.com/…
-
SDsolar almost 7 yearsYay for Sarah. That perl one is very surprising how fast it is. I then clean it up with a final
rm -rf <target>
to get all the directories. PLUS, I then have to check the trash from previous efforts. Thank you for this. Way beyond what I could have written. I do not understand the [9] part. And I had to catch that you used real apostrophes instead of reverse ticks. And it works. Thank you from August 2017 - Ubuntu 16.04 LTS -
Terry over 6 yearsThis is especially helpful when you are deleting an entire kernel source code directory. With
rm -rf*
, it would take more than 10 minutes (I actually never reached the end), but with the rsync example, it was a few seconds or total time for an entire Linux kernel source code directory. Good job on the answer. -
RonJohn about 6 yearsGNU is the de facto standard.
-
Svartalf over 5 yearsThe problem there is that NONE of the commands over there actually DO the desired traversal operation for deletion. The code they give? DOES NOT WORK as described by Marki555.
-
jtgd over 5 yearsWhy not
ionice -c3 find <dir> -type f -delete
-
EvgenyKolyakov over 4 yearsYou can even shorden it to
perl -e 'for(</path/to/your/dir/*>){((stat)[9]<(unlink))}'
-
Paul_Pedant over 4 yearsThe problem is that * does a shell expansion, which means: (a) it reads the entire directory, and then (b) sorts all the filenames, even before the find is invoked. Using ls -1 -U reads the directory in serial order. You can head -n 10000 and get a list to send to xargs rm. And because those names are all serial in the first part of the directory, they get deleted efficiently too. Just put that in a loop until no files are left, and it works pretty well.
-
Joshua Pinter over 4 yearsThanks for the reasoning @Paul_Pedant!
-
codenamezero over 4 yearsThat perl command don't work
-
dannysauer over 4 yearsWith GNU find, this is where
-exec rm {} \+
comes in handy (specifically the\+
in place of\;
), as it works like a built-in xargs without the minimal pipe and fork overhead. Still slower than other options, though. -
schily over 4 years@dannysauer
execplus
has been invented in 1988 by David Korn at AT&T and GNU find was the last find implementation to add support - more than 25 years later. BTW: the speed difference between the standardexecplus
and the nonstandard-delete
is minimal. -
dannysauer over 4 years@schily, that's interesting, and I'm a huge fan of Korn's work. However, the answer we're commenting on suggests that testing was happening on Linux. "GNU find" was specified to distinguish from other possible minimal Linux implementations, like busybox. :)
-
Dave over 3 yearsJust a warning - adding
-delete
to gnu find implicitly enables-depth
, which takes you back to the problem of running out of memory during the scan. -
Rehmat over 3 yearsI started a tar command last night and it is still stuck on a PHP sessions folder. It is better to remove sessions folder. But, I am using OpenVZ where
rm -rf
command takes lot of CPU and it gets terminated because of abuse. Usingrsync
is light weight and does not go above 1% CPU usage. iUsed nodes fromdf -i
shows 24710900. -
Andrey about 3 years
stat()
call in the Perl one-liner seems to be useless (its return value is compared to return value ofunlink()
which is pointless), so even more optimal version isperl -e 'for(<*>){unlink}'
-
Stephen Kitt about 3 years@Andrey re your suggested edit, for some people at least combining
stat
andunlink
results in faster deletion; have you had a chance to benchmark the two approaches? -
Andrey about 3 years@StephenKitt, no, I haven't had a chance to benchmark it yet, but I'll give it a try. If I read it correctly, Kent in his post found only one case when using stat+unlink is faster than solely unlink.
-
Rodrigo almost 3 yearsWoooohooo the perl line deleted ~500K files in 15 seconds!! That one goes directly to my toolbox
-
Philippe Remy over 2 yearsNot the fastest option: yonglhuang.com/rm-file
-
Freedo over 2 yearsOn ubuntu 18.04, perl just seems to run and do nothing
-
kaveh eyni over 2 yearsnot work, this command not for deleting file, just use this : 'rm -rf path/to/directory'
-
Asad Aizaz over 2 yearsUbuntu 20.04 that perl command does nothing. Does anybody have a recursive perl variant? And is there any way to get a progress bar for rsync? I tried -P and --info=progress2 but no progress bar.
-
Admin almost 2 years
rsync
method also helped me withValue too large for defined data type
error