How to move files between two S3 buckets with minimum cost?

62,816

Solution 1

Millions is a big number - I'll get back to that later.

Regardless of your approach, the underlying mechanism needs to be copying directly from one bucket to another - in this way (since your buckets are in the same region) you do not incur any charge for bandwidth. Any other approach is simply inefficient (e.g. downloading and reuploading the files).

Copying between buckets is accomplished by using 'PUT copy' - that is a PUT request that includes the 'x-amz-copy-source' header - I believe this is classed as a COPY request. This will copy the file and by default the associated meta-data. You must include a 'x-amz-acl' with the correct value if you want to set the ACL at the same time (otherwise, it will default to private). You will be charged for your COPY requests ($0.01/1,000 requests). You can delete the unneeded files after they have been copied (DELETE requests are not charged). (One point I am not quite clear on is whether or not a COPY request also incurs the charge of a GET request, as the object must first be fetched from the source bucket - if it does, the charge will be an additional $0.01/10,000 requests).

The above charges are seemingly unavoidable - for a million objects you are looking at around $10 (or $11). Since in the end you must actually create the files on the destination bucket, other approaches (e.g. tar-gzipping the files, Amazon Import/Export, etc) will not get around this cost. None the less, it might be worth your while contacting Amazon if you have more than a couple million objects to transfer.

Given the above (unavoidable price), the next thing to look into is time, which will be a big factor when copying 'millions of files'. All tools that can perform the direct copy between buckets will incur the same charge. Unfortunately, you require one request per file (to copy), one request to delete, and possibly one request to read the ACL data (if your files have varied ACLs). The best speed will come from whatever can run the most parallel operations.

There are some command line approaches that might be quite viable:

  • s3cmd-modification (that specific pull request) includes parallel cp and mv commands and should be a good option for you.
  • The AWS console can perform the copy directly - I can't speak for how parallel it is though.
  • Tim Kay's aws script can do the copy - but it is not parallel - you will need to script it to run the full copy you want (probably not the best option in this case - although, it is a great script).
  • CloudBerry S3 Explorer, Bucket Explorer, and CloudBuddy should all be able to perform the task, although I don't know how the efficiency of each stacks up. I believe though that the multi-threaded features of most of these require the purchase of the software.
  • Script your own using one of the available SDKs.

There is some possibility that s3fs might work - it is quite parallel, does support copies between the same bucket - does NOT support copies between different buckets, but might support moves between different buckets.

I'd start with s3cmd-modification and see if you have any success with it or contact Amazon for a better solution.

Solution 2

Old topic, but this is for anyone investigating the same scenario. Along with the time it took me, for 20,000+ objects. Running on AWS Linux/Centos, each object being images for the most part, along with some video and various media files.

Using the AWS CLI Tools to Copy the files from Bucket A to Bucket B.

A. Create the new bucket

$ aws s3 mb s3://new-bucket-name

B. Sync the old bucket with new bucket

$ aws s3 sync s3://old-bucket-name s3://new-bucket-name

Copying 20,000+ objects...

Started 17:03

Ended 17:06

Total time for 20,000+ objects = roughly 3 minutes

Once the new bucket is correctly configured, I.e. permissions, policy etc. and you wish to remove the old bucket.

C. Remove/delete the old bucket

$ aws s3 rb --force s3://old-bucket-name

Solution 3

The AWS CLI provides a way to copy one bucket to another in parallel processes. Taken from https://stackoverflow.com/a/40270349/371699:

The following commands will tell the AWS CLI to use 1,000 threads to execute jobs (each a small file or one part of a multipart copy) and look ahead 100,000 jobs:

aws configure set default.s3.max_concurrent_requests 1000
aws configure set default.s3.max_queue_size 100000

After running these, you can use the simple sync command as follows:

aws s3 sync s3://source-bucket/source-path s3://destination-bucket/destination-path

On an m4.xlarge machine (in AWS--4 cores, 16GB RAM), for my case (3-50GB files) the sync/copy speed went from about 9.5MiB/s to 700+MiB/s, a speed increase of 70x over the default configuration.

Solution 4

I am not sure it is the best approach but the AWS management console has a cut/copy/paste feature. very easy to use and efficient.

Solution 5

I'd imagine you've probably found a good solution by now, but for others who are encountering this problem (as I was just recently), I've crafted a simple utility specifically for the purpose of mirroring one S3 bucket to another in a highly concurrent, yet CPU and memory efficient manner.

It's on github under an Apache License here: https://github.com/cobbzilla/s3s3mirror

If you decide to give it a try please let me know if you have any feedback.

Share:
62,816

Related videos on Youtube

Daniel Cukier
Author by

Daniel Cukier

Daniel is a technology innovator, currently exploring web3 projects. Former CTO in Brazilian startups such as Pravaler - a fintech that offers accessible student loans - also founder and CTO at Playax - an audience development platform for music professionals based on BigData - he also worked for two years as CTO at Elo7 – the biggest crafts marketplace in Brazil. Experienced working in different programming languages such as Elixir, Ruby, JavaScript and Java, Daniel helped many startups as venture advisor at Monashees Capital and other accelerator programs in Brazil. He is also PhD in Computer Science at University of São Paulo – IME-USP. His PhD research is about Software Startups Ecosystems and Entrepreneurship. Daniel mastered in Computer Science in University of São Paulo in 2009, with the Thesis Patterns for Introducing New Ideas in the Software Industry. Daniel is a Cloud Computing GDE (Google Developer Expert). Daniel started developing software in Brazil when he was 10, on his TK-3000 Basic 2MB RAM computer. He worked as a consultant and software developer in many companies. In 2001, he worked for an Internet startup in Italy. In 2006 he joined Locaweb, the biggest web hosting company in Brazil and worked there for 5 years as developer and tech lead in infrastructure team. Daniel is an active member in the agile and software development communities, speaker in many conferences such as Elixir Brasil, QCON, Agile Brasil, TDC, DevCamp, Agile Trends and others. Studying other Arts beside software development, like Theatre, musical instruments and compositions, dance and writing, he acted in five musical plays and has a poetry book published. Daniel is a Vipassana meditation student and is very interested in topics related to human consciousness.

Updated on September 18, 2022

Comments

  • Daniel Cukier
    Daniel Cukier over 1 year

    I have millions of files in a Amazon S3 bucket and I'd like to move these files to other buckets and folders with minimum cost or no cost if possible. All buckets are in the same zone.

    How could I do it?

  • Noodles
    Noodles over 11 years
    Bucket Explorer seems to be working well for me (moving files between two buckets at the moment)
  • Micah
    Micah almost 11 years
    I had a great experience with s3s3mirror. I was able to set it up on a m1.small EC2 node and copy 1.5 million objects in about 2 hours. Setup was a little tough, due to my unfamiliarity with Maven and Java, but it only took a few apt-get commands on Ubuntu to get everything installed. One last note: If (like me) you're worried about running an unknown script on a big, important s3 bucket, create a special user with read-only access on the copy-from bucket and use those credentials. Zero chance of accidental deletion.
  • BenMorel
    BenMorel over 10 years
    Why repeat a solution that others have mentioned one year before?
  • Oliver Burdekin
    Oliver Burdekin almost 8 years
    Can this be applied to buckets between different accounts?
  • cobbzilla
    cobbzilla almost 8 years
    @OliverBurdekin yes there is a --cross-account-copy option (-C for short) to do this. Note that when copying across accounts, the ACLs are not copied; the owner of the destination bucket will have full permissions to the copied data.
  • Oliver Burdekin
    Oliver Burdekin almost 8 years
    Thanks @rfcreader How can I estimate the cost of this? I'm aware of AWS cost calculator but have no idea what this process will involve in terms of number of gets puts ls requests etc. I imagine it's pretty easy to count up these metrics using CLI but if you know more please get in touch. AWS support suggested "requester pays". ha!
  • cobbzilla
    cobbzilla almost 8 years
    @OliverBurdekin s3s3mirror does keep track of the number of AWS requests by type (GET, COPY, DELETE, etc). These stats are printed out periodically when running, and one last time at the end. You could do a limited/test run to copy a small subset of the objects, this should give you a general feel for how many total requests will be required to copy the entire data set.
  • James
    James over 7 years
    That's not likely to work well with a million files.
  • Olivier Lalonde
    Olivier Lalonde about 7 years
    Where does aws s3 sync s3://source s3://destination fit in?
  • rob
    rob almost 6 years
    @James can painfully confirm that ;)
  • Marcelo Agimóvel
    Marcelo Agimóvel over 5 years
    Life savior. I'm copying 300+GBs. A tip: if you copy from buckets in same region is way faster than in another region (and I read its less expensive).
  • Marcelo Agimóvel
    Marcelo Agimóvel over 5 years
    I had a problem with your method: files privacy were all set to PRIVATE, even most of objects being public, what happend?