Can I stream a file upload to S3 without a content-length header?
Solution 1
You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.
- Initiate S3 Multipart Upload.
- Gather data into a buffer until that buffer reaches S3's lower chunk-size limit (5MiB). Generate MD5 checksum while building up the buffer.
- Upload that buffer as a Part, store the ETag (read the docs on that one).
- Once you reach EOF of your data, upload the last chunk (which can be smaller than 5MiB).
- Finalize the Multipart Upload.
S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.
However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.
First 25 parts: 5MiB (total: 125MiB)
Next 25 parts: 10MiB (total: 375MiB)
Next 25 parts: 25MiB (total: 1GiB)
Next 25 parts: 50MiB (total: 2.25GiB)
After that: 100MiB
This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.
A note on your link to Sean O'Donnells blog:
His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:
data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()
His solution does it by getting the Content-Length
via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:
upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()
input = File.open(file_name, File::READONLY_FLAG)
while (data = input.read())
input.write(data)
end
upload.flush()
upload.close()
Solution 2
Putting this answer here for others in case it helps:
If you don't know the length of the data you are streaming up to S3, you can use S3FileInfo
and its OpenWrite()
method to write arbitrary data into S3.
var fileInfo = new S3FileInfo(amazonS3Client, "MyBucket", "streamed-file.txt");
using (var outputStream = fileInfo.OpenWrite())
{
using (var streamWriter = new StreamWriter(outputStream))
{
streamWriter.WriteLine("Hello world");
// You can do as many writes as you want here
}
}
Solution 3
You can use the gof3r command-line tool to just stream linux pipes:
$ tar -czf - <my_dir/> | gof3r put --bucket <s3_bucket> --key <s3_object>
Solution 4
If you are using Node.js you can use a plugin like s3-streaming-upload to accomplish this quite easily.
Solution 5
Refer more on HTTP multi-part enitity requests. You can send a file as chunks of data to the target.
Related videos on Youtube
Tyler
I run an iPhone app consultancy called Bynomial. Please let me know if you're interested in app development! I'm always looking for good coders and good clients. My background: I love math -- I got a PhD in it from the Courant Institute at NYU in 2006. After that I was a coder at Google for a few years, and in 2009 started Bynomial.
Updated on September 24, 2021Comments
-
Tyler over 2 years
I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.
S3 can support streaming uploads. For example, see here:
http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/
My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?
-
Radim over 9 yearsThe smart_open Python library does that for you (streamed read and write).
-
Ermiya Eskandary about 2 years10 years later & the AWS S3 SDKs still don't have a managed way to do this - as someone who is hugely invested in the AWS ecosystem, it's very disappointing to see this in comparison to object management SDKs provided by other cloud providers. This is a core feature missing.
-
-
Steve K almost 10 yearsIs there a Java equivalent of these classes?
-
sigget over 9 yearsA java implementation of this in the form of an OutputStream exists in s3distcp github.com/libin/s3distcp/blob/master/src/main/java/com/amazon/…
-
Alex Hall over 8 yearsI've created an open source library dedicated to this at github.com/alexmojaki/s3-stream-upload
-
at0mzk over 7 yearsisnt the length of "Hello world" known? does it work if the input is a stream?
-
Landon Kuhn over 7 yearsWhere did you find the 5MiB limit?
-
chrismarx over 5 yearsLooks like you can also use the cli now with pipe - github.com/aws/aws-cli/pull/903
-
Admin almost 5 yearsis there a way to just do
tar -czf - <my_dir/> | aws s3 --something-or-other
? -
xiaochuanQ over 4 yearsnot supported in dotnet core, since the synchronous nature of Amazon.S3.IO apis, per Microsoft.
-
Tushar Kolhe about 4 years@AlexHall any python implementation?
-
Alex Hall about 4 years@TusharKolhe googling "python stream multipart upload s3" I found stackoverflow.com/questions/31031463/… and stackoverflow.com/questions/52825430/… and it looks like there were more results
-
Tushar Kolhe about 4 years@AlexHall thanx i figured out the way, this is the actual problem that i m trying to solve stackoverflow.com/questions/61696155/…. In case of a file already on disk i m able to do this.. but i want to upload streaming frames