Backup: Amazon S3 or Glacier - lots of little files?

19,662

Solution 1

Detailed pricing information for S3 is available here. Specifics of the API functions available are here.

For S3, you are mostly charged for upload bandwidth (bytes sent TO S3), download bandwidth (bytes received FROM S3), and storage (bytes IN S3). You are also charged for the number and type of API calls.

So, if you upload your 10GB of data to S3 in 10,000 1MB files, store it for a month, and then download each of the files once, you'll be charged:

  • $0.00 for upload bandwidth (this is free)
  • $0.10 for the 10,000 PUT requests to upload the files
  • $0.95 for storing the 10GB for a month
  • $1.08 for 10GB download bandwidth (the first is free, then $0.12/GB)
  • $0.01 for the 10,000 GET requests to download the files

That's $2.14. If you uploaded and downloaded once each, but kept the data for a year, only the storage cost would go up to 12 * $0.95, or $11.40. If your files averaged only 100KB, so you had 100,000 of them, you'd pay 10 times as much for the PUT and GET requests, or $1.10 instead of $0.11.

You can only upload and download a single file per operation. If you combined your files into one using Zip, you'd only save by using fewer operations, which, as you can see, are pretty cheap to start with.

There is one quirk here, though. I'm pretty sure you are charged for all bandwidth usage when uploading and downloading, including request headers, not just the bodies containing your data. So if your files were really tiny the request headers might become significant, perhaps as much as the files themselves. In that case your bandwidth costs would double.

Glacier pricing is more complicated, and I've never used it myself. Basically, it reduces storage cost by almost ten-fold, leaving the other costs the same, and adding costs to archive and restore per object. Those costs seem to be significant if you have a lot of small objects, need to get a lot of your files at a time, or get files frequently. Glacier seems to be best when you have a lot of data (terabytes or more, not just gigabytes), but few operations. Given that you only have 10GB of data, S3 is so inexpensive it doesn't seem worth it to consider Glacier.

Finally, AWS has a free usage tier for the first year, which looks like it would cover all your costs except for half the storage charges.

Solution 2

I know this is a bit old, but you may still find my answer helpful (I hope). The other answer is based on S3 which wasn't your question I believe.

Glacier is intended for rare file access. Having that in mind they sort of punish you if you need to retrieve many files at once. In your particular case I would suggest uploading 10.000 separate files instead of let's say 100 ZIP files with 100 files each. The reason is very simple. Glacier will let you download for free only 5% of the total archive and is prorated daily. So if, for example, you need to download 10 photos you took on a weekend, you would be able to get those 10 photos for free if they are spread in the vault. On the other hand, if you have a ZIP file that has 100 photos inside, you'll be forced to download that zip that will probably be more than 5% of the total archive meaning you'll be paying some fees for the retrieval.

The only reason it makes sense to upload fewer files is to avoid high upload requests (10.000 files usually mean 10.000 requests). Requests are charged $0,05 per 1000. This fees are much lower that retrieval fees (taking into account the limits imposed), that's why I would always recommend uploading separate files. Of course you may zip files that make sense to be together.

Retrieval costs are very complex in Amazon Glacier. They have a good explanation here: http://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free But even there you'll need to pay attention on the calculations to get a clear idea on how costs are billed.

Regarding this question: Am I able to request the download of a whole Archive/Bucket or is it file-by-file?

Requests are by file-by-file, although you can select many files at once and download them altogether.

Deciding whether to use S3 or Glacier really depends on your needs on file access. If you will rearly need access to your files then Glacier is your answer. Otherwise for 10GB S3 can still be cheap and be more flexible than Glacier. In my case I find family photos to be a very precious thing. That's why I have a 100GB backup on glacier with all my family photos. I don't intend to access it unless there is some kind of disaster at home. In that case, I think I would not mind the retrieval cost if that saved something I really care about. But that's just me.

Solution 3

Better use few larger files than lot of small ones

There are two approaches to putting files into Amazon Glacier. You either interact with vaults directly, or use S3 as frontend.

I am using S3 (and Amazon Management Console) so that I am able to see content of the archive and at the same time have it stored cheaply in Glacier.

This approach has one drawback - as storing any piece of information in Glacier has some data overhead (which you pay for too), then there is logically a break even point. Before 2014-04 price reduction I made a calculation and critical size is about 16 kB, storing smaller files in Glacier (using AWS S3 as frontend) was more expensive than keeping it only on S3. With price reduction for S3 storage (Glacier did not change) the break even point went even higher.

I guess, that even without S3 as frontend, the situation will be similar, even though a bit more friendly to smaller files.

Solution 4

Since November 21, 2016, Amazon updated the free tier policy for Glacier retrievals and updated the "5% of your average monthly storage" policy in favor of a flat 10GB free per month. However, if your retrieval policy was set prior to that day, then you're still on the "5%" policy and the other answers here still apply to you.

If your retrieval policy was set after Nov 21, 2016, and you're in the OP's shoes:

You're only storing 10GB, so you could retrieve all of your data for free once per month using Standard retrievals. It would make no difference if all 10,000 photos are zipped into one zip file or not (for retrievals).

The only variable in this scenario is number of upload requests. 10,000 requests at a price of $0.05 per 1,000 is only $0.50 and that's a one time fee for your specific case.

More pricing info at AWS Glacier FAQ

UPDATE:

Glacier docs recommend using multipart upload for files larger than 100MB.

I came to this conclusion independently after a couple timeouts when trying to upload an 8GB file.

Share:
19,662
Markive
Author by

Markive

Updated on August 23, 2022

Comments

  • Markive
    Markive over 1 year

    I'm trying to understand the complicated Amazon Glacier pricing model. I don't want to store a huge amount of data, a few GB's say 10. I hope never to download the files and if I did need to I don't care how long it takes.

    Is there a cost per file I upload? Is it cheaper to zip lots of tiny files and upload in a few chunks or does 10,000 say images not matter? (cannot get a straight answer to this during searching)

    Am I able to request the download of a whole Archive/Bucket or is it file-by-file?

  • Markive
    Markive about 11 years
    At < $15/yr it is a bit miserly even considering Glacier. So with Glacier can you only request one file at a time, and get charged that way.. Surely you can do a whole archive?
  • Marc Rochkind
    Marc Rochkind about 9 years
    I agree with the large files idea, but not for the reason you give. Glacier is best considered the backup of last resort. You would retrieve a file only if all your other backups failed, and those should be more accessible. (Hard drives stored offsite, DVDs, etc.) So, cost to retrieve is irrelevant, as paying for the retrieval when you've had a total disaster is not a problem. The issue is reliability of the upload. With thousands of files, it's impractical to manage the checksums to verify that the uploads went OK.
  • Jan Vlcinsky
    Jan Vlcinsky about 9 years
    @MarcRochkind The OP was about costs, so ignoring cost aspect would be ignoring the question. Also note, that I do not talk about retrieval costs, but about storage costs - which have break even point. If one does not care about costs, than I would recommend using AWS S3, as it is much easier to handle than Amazon Glacier.
  • Erik Sjölund
    Erik Sjölund about 8 years
    Maybe things have changed since 2013, but looking at aws.amazon.com/s3/pricing it seems there is no transfer fee for uploading. Maybe this sentence in the answer could be rephrased For S3, you are mostly charged for upload bandwidth (bytes sent TO S3) ... .