What is the difference between S3.Client.upload_file() and S3.Client.upload_fileobj()?

21,409

Solution 1

The main point with upload_fileobj is that file object doesn't have to be stored on local disk in the first place, but may be represented as file object in RAM.

Python have standard library module for that purpose.

The code will look like

import io
fo = io.BytesIO(b'my data stored as file object in RAM')
s3.upload_fileobj(fo, 'mybucket', 'hello.txt')

In that case it will perform faster, since you don't have to read from local disk.

Solution 2

TL;DR

in terms of speed, both methods will perform roughly the same, both written in python and the bottleneck will be either disk-io (read file from disk) or network-io (write to s3).

  • use upload_file() when writing code that only handles uploading files from disk.
  • use upload_fileobj() when you writing generic code to handle s3 upload that may be reused in future for not only file from disk usecase.


What is fileobj anyway?

there is convention in multiple places including the python standard library, that when one is using the term fileobj she means file-like object. There are even some libraries exposing functions that can take file path (str) or fileobj (file-like object) as the same parameter.

when using file object your code is not limited to disk only, for example:

  1. for example you can copy data from one s3 object into another in streaming fashion (without using disk space or slowing down the process for doing read/write io to disk).

  2. you can (de)compress or decrypt data on the fly when writing objects to S3

example using python gzip module with file-like object in generic way:

import gzip, io

def gzip_greet_file(fileobj):
    """write gzipped hello message to a file"""
    with gzip.open(filename=fileobj, mode='wb') as fp:
        fp.write(b'hello!')

# using opened file
gzip_greet_file(open('/tmp/a.gz', 'wb'))

# using filename from disk
gzip_greet_file('/tmp/b.gz')

# using io buffer
file = io.BytesIO()
gzip_greet_file(file)
file.seek(0)
print(file.getvalue())

tarfile on the other hand has two parameters file & fileobj:

tarfile.open(name=None, mode='r', fileobj=None, bufsize=10240, **kwargs)


Example compression on-the-fly with s3.upload_fileobj()

import gzip, boto3

s3 = boto3.resource('s3')


def upload_file(fileobj, bucket, key, compress=False):
    if compress:
        fileobj = gzip.GzipFile(fileobj=fileobj, mode='rb')
        key = key + '.gz'
    s3.upload_fileobj(fileobj, bucket, key)

Solution 3

Neither is better, because they're not comparable. While the end result is the same (an object is uploaded to S3), they source that object quite differently. One expects you to supply the path on disk of the file to upload while the other expects you to provide a file-like object.

If you have a file on disk and want to upload it, then use upload_file. If you have a file-like object (which could ultimately be many things including an open file, a stream, a socket, a buffer, a string) then use upload_fileobj.

A 'file-like object' in this context is anything that implements the read method, and returns bytes.

Share:
21,409

Related videos on Youtube

Flair
Author by

Flair

Updated on July 09, 2022

Comments

  • Flair
    Flair almost 2 years

    According to S3.Client.upload_file and S3.Client.upload_fileobj, upload_fileobj may sound faster. But does anyone know specifics? Should I just upload the file, or should I open the file in binary mode to use upload_fileobj? In other words,

    import boto3
    
    s3 = boto3.resource('s3')
    
    ### Version 1
    s3.meta.client.upload_file('/tmp/hello.txt', 'mybucket', 'hello.txt')
    
    ### Version 2
    with open('/tmp/hello.txt', 'rb') as data:
        s3.upload_fileobj(data, 'mybucket', 'hello.txt')
    

    Is version 1 or version 2 better? Is there a difference?

  • Laurens Koppenol
    Laurens Koppenol over 4 years
    The same source states: The upload_file method accepts a file name, a bucket name, and an object name. The method handles large files by splitting them into smaller chunks and uploading each chunk in parallel. and The upload_fileobj method accepts a readable file-like object. The file object must be opened in binary mode, not text mode. Above answers state just that.