Download a folder from S3 using Boto3

58,179

Solution 1

quick and dirty but it works:

import boto3
import os 

def downloadDirectoryFroms3(bucketName, remoteDirectoryName):
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    for obj in bucket.objects.filter(Prefix = remoteDirectoryName):
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        bucket.download_file(obj.key, obj.key) # save to same path

Assuming you want to download the directory foo/bar from s3 then the for-loop will iterate all the files whose path starts with the Prefix=foo/bar.

Solution 2

A slightly less dirty modification of the accepted answer by Konstantinos Katsantonis:

import boto3
s3 = boto3.resource('s3') # assumes credentials & configuration are handled outside python in .aws directory or environment variables

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    bucket = s3.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

This downloads nested subdirectories, too. I was able to download a directory with over 3000 files in it. You'll find other solutions at Boto3 to download all files from a S3 Bucket, but I don't know if they're any better.

Solution 3

You could also use cloudpathlib which, for S3, wraps boto3. For your use case, it's pretty simple:

from cloudpathlib import CloudPath

cp = CloudPath("s3://bucket/folder/folder2/")
cp.download_to("local_folder")

Solution 4

Using boto3 you can set aws credentials and download dataset from S3

import boto3
import os 

# set aws credentials 
s3r = boto3.resource('s3', aws_access_key_id='xxxxxxxxxxxxxxxxx',
    aws_secret_access_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
bucket = s3r.Bucket('bucket_name')

# downloading folder 
prefix = 'dirname'
for object in bucket.objects.filter(Prefix = 'dirname'):
    if object.key == prefix:
        os.makedirs(os.path.dirname(object.key), exist_ok=True)
        continue;
    bucket.download_file(object.key, object.key)

If you cannot find ur access_key and secret_access_key, refer to this page
I hope it will helps.
thank you.

Solution 5

Another approach building on the answer from @bjc that leverages the built in Path library and parses the s3 uri for you:

import boto3
from pathlib import Path
from urllib.parse import urlparse

def download_s3_folder(s3_uri, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        s3_uri: the s3 uri to the top level of the files you wish to download
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(urlparse(s3_uri).hostname)
    s3_path = urlparse(s3_uri).path.lstrip('/')
    if local_dir is not None:
        local_dir = Path(local_dir)
    for obj in bucket.objects.filter(Prefix=s3_path):
        target = obj.key if local_dir is None else local_dir / Path(obj.key).relative_to(s3_path)
        target.parent.mkdir(parents=True, exist_ok=True)
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, str(target))
Share:
58,179
El Fadel Anas
Author by

El Fadel Anas

Updated on July 09, 2022

Comments

  • El Fadel Anas
    El Fadel Anas almost 2 years

    Using Boto3 Python SDK, I was able to download files using the method bucket.download_file()

    Is there a way to download an entire folder?

  • Arkady
    Arkady almost 5 years
    But you didn't set credentials!
  • Konstantinos Katsantonis
    Konstantinos Katsantonis almost 5 years
    @Arkady The credentials are set under ~/.aws/credentials or as Environment variables You can find more info here
  • Patrick Pötz
    Patrick Pötz over 4 years
    Credentials can be set in different ways. See boto3.amazonaws.com/v1/documentation/api/latest/guide/…
  • Vidura Dantanarayana
    Vidura Dantanarayana over 4 years
    you can declare aws credentials when creating s3 resource as following s3_resource = boto3.resource('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)
  • Zach Rieck
    Zach Rieck almost 4 years
    Better to avoid putting your keys in your code file. At worst, you can put your keys in a separate protected file and import them. It's also possible to use boto3 without any credentials cached and instead use either s3fs or just rely on the config file (reddit.com/r/aws/comments/73212m/…)
  • Dinero
    Dinero over 3 years
    I posted a similar question stackoverflow.com/questions/64226700/… is your suggested answer the best way i can solve my problem ?
  • Konstantinos Katsantonis
    Konstantinos Katsantonis over 3 years
    Well there is almost never absolute best :).
  • Ioannis Tsiokos
    Ioannis Tsiokos over 3 years
    to make this recursive (for directories inside directories), only download file if not obj.key.endswith('/'):
  • Alex
    Alex almost 3 years
    does somebody know, if AWS counts this as one request for the billing?!
  • hume
    hume almost 3 years
    Probably not. It should work out to about same as looping over each key with boto3 (maybe with an added call to list objects, but you need that in both cases)
  • Luiz Tauffer
    Luiz Tauffer over 2 years
    for me it only worked without the trailing /... in the example above it would be: cp = CloudPath("s3://bucket/folder/folder2")
  • trungducng
    trungducng over 2 years
    @hume Can I pass the relative path to the CloudPath. For example: "s3://bucket/*/*/device/" ?
  • hume
    hume over 2 years
    @trungducng There is a glob method like with a normal Path that you can use to loop over those files and call download_to on each one individually. cloudpathlib.drivendata.org/stable/api-reference/s3path/…
  • user2755526
    user2755526 about 2 years
    Why isn't this the top solution!! This tool is awesome.