Iterate over files in an S3 bucket with folder structure

18,645

When using boto3 you can only list 1000 objects per request. So to obtain all the objects in the bucket, you can use s3's paginator.

client.get_paginator('list_objects_v2') is what you need.

Something like this is what you need:

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucketname',StartAfter='2018')
for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
            print keyString

From this documentation:

list_objects:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

list_objects_v2:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development.

From this answer:

list_objects_v2 has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With ContinuationToken, you don't need to know the last key, you just check existence of NextContinuationToken in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page.

Share:
18,645
DataDog
Author by

DataDog

Updated on June 08, 2022

Comments

  • DataDog
    DataDog almost 2 years

    I have an S3 bucket. Inside the bucket, we have a folder for the year, 2018, and some files we have collected for each month and day. So, as an example, 2018\3\24, 2018\3\25 so forth and so on.

    We didn't put the dates in the files inside each days bucket.

    Basically, I want to iterate through the bucket and use the folders structure to classify each file by it's 'date' since we need to load it into a different database and will need a way to identify.

    I've read a ton of posts on using boto3, and iterating through however there seem to be conflicting details on if what I need can be done.

    If there's an easier way of doing this please suggest.

    I got it close import boto3

    s3client = boto3.client('s3')
    bucket = 'bucketname'
    startAfter = '2018'
    
    s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
    for object in s3objects['Contents']:
        print(object['Key'])
    
  • DataDog
    DataDog about 6 years
    Thanks for the info. This is a help. I've been playing with both the original boto and boto 3 and I think where I want to end up is basically a dict with file names as the keys and then the creation dates as the values. for key in mybucket.list(): print "{name}\t{created}".format( name = key.name, created = key.creation_date, ) It's throwing an error which I think is unrelated, but if I can arrive at something like that dictionary, it should work.
  • Venkatesh Wadawadagi
    Venkatesh Wadawadagi about 6 years
    Glad to know that it was of some help. Please 'mark answer as accepted' if it served the purpose.