EntityTooLarge error when uploading a 5G file to Amazon S3

17,612

Solution 1

The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:

Depending on the size of the data you are uploading, Amazon S3 offers the following options:

  • Upload objects in a single operation—With a single PUT operation you can upload objects up to 5 GB in size.

  • Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.

http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.

Solution 2

If you are using aws cli for the upload, you can use 'aws s3 cp' command so it does not require splitting and multi part upload

aws s3 cp masive-file.ova s3://<your-bucket>/<prefix>/masive-file.ova

Solution 3

The trick usually seems to be figuring out how to tell S3 to do a multipart upload. For copying data from HDFS to S3, this can be done by using the s3n filesystem and specifically enabling multipart uploads with fs.s3n.multipart.uploads.enabled=true

This can be done like:

hdfs dfs -Dfs.s3n.awsAccessKeyId=ACCESS_KEY -Dfs.s3n.awsSecretAccessKey=SUPER_SECRET_KEY -Dfs.s3n.multipart.uploads.enabled=true -cp hdfs:///path/to/source/data s3n://bucket/folder/

And further configuration can be found here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Share:
17,612
Daniel Mahler
Author by

Daniel Mahler

Updated on June 11, 2022

Comments

  • Daniel Mahler
    Daniel Mahler almost 2 years

    Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file

    '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: 
      <?xml version="1.0" encoding="UTF-8"?>
      <Error>
        <Code>EntityTooLarge</Code>
        <Message>Your proposed upload exceeds the maximum allowed size</Message>
        <ProposedSize>5374138340</ProposedSize>
        ...
        <MaxSizeAllowed>5368709120</MaxSizeAllowed>
      </Error>
    

    This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using SchemRDD.saveAsParquetFile method. The full stack trace is

    org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error>
            org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)
            sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            java.lang.reflect.Method.invoke(Method.java:606)
            org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
            org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
            org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source)
            org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174)
            org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
            org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
            parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
            parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
            parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
            org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)
            org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
            org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
            org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
            org.apache.spark.scheduler.Task.run(Task.scala:54)
            org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:745)
    

    Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?

  • Sean
    Sean about 5 years
    Glad to hear it!
  • Raj
    Raj about 5 years
    stackoverflow.com/questions/55427694/… Can anybody please help me regarding this ?
  • VIPIN KUMAR
    VIPIN KUMAR over 3 years
    Sometimes all you need is a simple command.