Need help deciding between EBS vs S3 on Amazon Web Services

16,745

Solution 1

If your service is going to be used by an undetermined number of users, it is important to bear in mind that scaleability will always be a concern, regardless of the option adopted, you will need to scale the service to meet demand, so it would be convenient assume that your service will be running in a Auto Scaling Group with a pool of EC2 instances and not a single instance.

Regarding the protection of the URL to allow only authorized users download the files, there are many ways to do this without requiring your service to act as an intermediate, then you will need to deal with at least two issues:

  1. File name predictability: to avoid URL predictability, you could name the uploaded file as a hash and store the original filenames and ownerships in a database like SimpleDB, optionally you can set a http header such as "Content-Disposition: filename=original_file_name.ext" to advise users browser to name the downloaded file accordingly.

  2. authorization: when the user ask to download a given file your service, issue a temporary authorization using Query String Authentication or Temporary Security Credentials for that specific user giving read access to the file for a period of time then your service redirects to the S3 bucket URL for direct download. This can greatly offload your EC2 pool instances, making then available to process other requests more quickly.

To reduce the space and traffic to your S3 bucket (remember you pay per GB stored and transferred), I would also recommend compressing each individual file using a standard algorithm like gzip before uploading to S3 and set the header " Content-Encoding: gzip " in order to make automatic uncompression work with users browser. If your programming language of choice is Java, I suggest taking a look at the plugin code webcache-s3-maven-plugin that I created to upload static resources from web projects.

Regarding the processing time in compressing a folder, you will frequently be unable to ensure that the folders are going to be compressed in short time, in order to allow the user to download it immediately, since eventually there could be huge folders that could take minutes or even hours to be compressed. For this I suggest you to use the SQS and SNS services in order to allow asynchronous compression processing, it would work as follows:

  1. user requests folder compression
  2. the frontend EC2 instance creates a compression request in an SQS queue
  3. a backend EC2 instance, consumes the compression request of the SQS queue
  4. the backend instance downloads the files from S3 to a EBS drive, since the generated files will be temporary I would suggest to choose to use at least m1.small instances with ephemeral type disks, which are local to the virtual machine in order to reduce I/O latency and the processing time.
  5. after the compressed file is generated, the service uploads the file to the S3 bucket, optionally setting the Object Expiration properties, that will tell S3 bucket to delete the file automatically after a certain period of time (again to reduce your storage costs), and publishes a notification that the file is ready to be downloaded in a SNS topic.
  6. if the user is still online, read the notification from the topic, and notify the user that the zip file is ready to be downloaded, if after a while this notification did not arrive, you can tell the user that compression is taking longer than expected and the service will notify him by e-mail as soon as the file is ready to be downloaded.

In this scenario you could have two Auto Scaling Groups, respectively frontend and backend, that may have different scaleability restrictions.

Solution 2

If you are insistent on serving the zip files directly from your EC2 instance using S3 will just be more complicated than storing them locally. But S3 is much more durable than any EC2 storage volumes, so I'd recommend using it anyway if the files need to be kept a long time.

You say you don't want to expose the file URLs directly. If that's just because you don't want people to be able to bookmark them and bypass your service authentication in the future, S3 has a great solution:

1 - Store the files you want to serve (zipped up if you want it that way) in a private S3 bucket.

2 - When a user requests a file, authenticate the request and then redirect valid requests to a signed, temporary S3 URL of the file. There are plenty of libraries in a variety of languages that can create those URLs.

3 - The user downloads the file directly from S3, without it having to pass through your EC2 instance. That saves you bandwidth and time, and probably gives the fastest download possible to the user.

This does expose a URL, but that's probably okay. There's no problem if the user saves the URL, because it will not work after the expiration time you set on it. For my service I set that time to 5 minutes. Since it is digitally signed, the user can't change the expiration time in the URL without invalidating the signature.

Solution 3

Using S3 is a better option for.this use case. It scales better and will be simplier. Why are you concerned about it being slow? Transfers between EC2 and S3 are pretty snappy.

Solution 4

Some considerations:

  1. EBS Volume costs is several times that of S3.
  2. EBS volume size limits are 16 TB, so that should not be an issue. However, volumes of that size are very expensive.
  3. Make sure that your bucket is located in the same region as your EC2 instances.
  4. Use VPC endpoints to communicate with S3. This is much faster.
  5. Make sure that your EC2 instance type has the network bandwidth that you need. CPU and Network speed goes up with instance size.

I would keep everything on S3, download the files as required to zip them into a package. Then upload the zip to S3 and deliver to the user an S3 Signed URL to download from S3.

You could allow the user to download from your EC2 instance, but lots of users have error problems, retry issues, slow bandwidth, etc. If the zip files are small (less then 100 MB) deliver locally, otherwise upload to S3 and let S3 deal with the user download issues.

Another option would be to create a Lambda function that creates the zip file and stores on S3. Now you don't have to worry about network bandwidth or scaling. The Lambda function could either return to you the S3 URL, which you deliver to the browser, or Lambda could email the customer a link. Look into SES for this. Note: The Lambda file system only has 512 MB of space, memory can be allocated up to 1.5 GB. If you are generating zip files larger than this, Lambda won't work (at this time). However, you could create multiple zip files (part1, part2, ...)

Share:
16,745
andrewvnice
Author by

andrewvnice

Updated on June 30, 2022

Comments

  • andrewvnice
    andrewvnice almost 2 years

    I'm working on a project that incorporates file storage and sharing features and after months of researching the best method to leverage AWS I'm still a little concerned.

    Basically my decision is between using EBS storage to house user files or S3. The system will incorporate on-the-fly zip archiving when the user wants to download a handful of files. Also, when users download any files I don't want the URL to the files exposed.

    The two best options I've come up with are:

    1. Have an EC2 instance which has a number of EBS volumes mounted to store user files.

      • pros: It seems much faster than S3, and zipping files from the EBS volume is straight forward.
      • cons: I believe Amazon caps how much EBS storage you can use and there is not as redundant as S3.
    2. After files are uploaded and processed, the system pushes those files to an S3 bucket for long term storage. When files are requested I will retrieve the files from S3 and output back to the client.

      • pros: Redundancy, no file storage limits
      • cons: It seems very SLOW, no way to mount an S3 bucket as a volume in filesystem, serving zipped files would mean transferring each file to the EC2 instance, zipping, and then finally sending output (again, slow!)

    Are any of my assumptions flawed? Can anyone think of a better way of managing massive amounts of file storage?