EntityTooLarge error when uploading a 5G file to Amazon S3
Solution 1
The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:
Depending on the size of the data you are uploading, Amazon S3 offers the following options:
Upload objects in a single operation—With a single
PUT
operation you can upload objects up to 5 GB in size.Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.
http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET
request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.
Solution 2
If you are using aws cli for the upload, you can use 'aws s3 cp' command so it does not require splitting and multi part upload
aws s3 cp masive-file.ova s3://<your-bucket>/<prefix>/masive-file.ova
Solution 3
The trick usually seems to be figuring out how to tell S3 to do a multipart upload. For copying data from HDFS to S3, this can be done by using the s3n filesystem and specifically enabling multipart uploads with fs.s3n.multipart.uploads.enabled=true
This can be done like:
hdfs dfs -Dfs.s3n.awsAccessKeyId=ACCESS_KEY -Dfs.s3n.awsSecretAccessKey=SUPER_SECRET_KEY -Dfs.s3n.multipart.uploads.enabled=true -cp hdfs:///path/to/source/data s3n://bucket/folder/
And further configuration can be found here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
Daniel Mahler
Updated on June 11, 2022Comments
-
Daniel Mahler almost 2 years
Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file
'/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?> <Error> <Code>EntityTooLarge</Code> <Message>Your proposed upload exceeds the maximum allowed size</Message> <ProposedSize>5374138340</ProposedSize> ... <MaxSizeAllowed>5368709120</MaxSizeAllowed> </Error>
This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using
SchemRDD.saveAsParquetFile
method. The full stack trace isorg.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source) org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174) org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321) parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111) parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745)
Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?
-
Martin Thoma over 4 yearsFor Python users: Complete a multipart_upload with boto3?
-
-
Sean about 5 yearsGlad to hear it!
-
Raj about 5 yearsstackoverflow.com/questions/55427694/… Can anybody please help me regarding this ?
-
VIPIN KUMAR over 3 yearsSometimes all you need is a simple command.