Most efficient way to batch delete S3 Files

54,377

Solution 1

AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).

The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!

I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).

If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.

If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.

Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.

Solution 2

The excruciatingly slow option is s3 rm --recursive if you actually like waiting.

Running parallel s3 rm --recursive with differing --include patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include pattern matching.

Enter bulk deletion.

I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects.

Here's an example:

cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "{Key=%s}," "$@")],Quiet=true"' _
  • The -P8 option on xargs controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time.
  • The -n1000 option tells xargs to bundle 1000 keys for each aws s3api delete-objects call.
  • Removing ,Quiet=true or changing it to false will spew out server responses.
  • Note: There's an easily missed _ at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.

But how do you get file-of-keys?

If you already have your list of keys, good for you. Job complete.

If not, here's one way I guess:

aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys

Solution 3

A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.

https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

Solution 4

I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:

aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files

For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux or screen session and check back later.

Solution 5

There already mention about s3 sync command before, but without example and word about --delete option.

I found it fastest way to delete content of folder in S3 bucket my_bucket by:

aws s3 sync --delete "local-empty-dir/" "s3://my_bucket/path-to-clear"

Share:
54,377

Related videos on Youtube

SudoKill
Author by

SudoKill

Updated on September 18, 2022

Comments

  • SudoKill
    SudoKill almost 2 years

    I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:

    1. How does S3 handle file deletion, especially when deleting large numbers of files?
    2. Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
  • Brandon
    Brandon over 6 years
    It looks like the aws s3 rm --recursive command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
  • Vlad Nikiforov
    Vlad Nikiforov almost 6 years
    You probably should also have stressed the importance on _ in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is that bash -c passes all arguments as positional parameters, starting with $0, while "$@" only processes parameters starting with $1. So the underscore dummy is needed to fill the position of $0.
  • antak
    antak almost 6 years
    @VladNikiforov Cheers, edited.
  • joelittlejohn
    joelittlejohn over 5 years
    One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've used split -l 1000 to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
  • Hayden
    Hayden over 4 years
    If you just want al list of the keys, I would think aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | awk '{print $4}' would be simpler and you can add a | grep to filter that down from there.
  • Will
    Will over 4 years
    Be careful, though, as this can be very expensive if you have a lot of objects, stackoverflow.com/questions/54255990/…
  • Nathan Loyer
    Nathan Loyer over 3 years
    Man doing this with millions of objects really sucks, but thanks to all of you for the pointers. I used the split command to to split into 10k key files, then used SEK's command to run them with some parallelism. Then also deleting the split files when completed to offer some checkpointing. I found 10 threads to give me warnings to slow down from AWS. I'm going with 4 at the moment and it's going well.
  • imdibiji
    imdibiji about 3 years
    For aws cli v2, disabling the pager helps when running the s3api delete-objects command: export AWS_PAGER=""
  • sonlexqt
    sonlexqt over 2 years
    I've tried different ways and looks like this option works best for me! Emptied my bucket with ~100k files and ~50GB in size in minutes.