How to query items from AWS S3 by date created

12,264

Solution 1

Update 3/19/2019

Apparently s3api allows you to do this quite easily

One solution would probably to use the s3api. It works easily if you have less than 1000 objects, otherwise you need to work with pagination.

s3api can list all objects and has a property for the lastmodified attribute of keys imported in s3. It can then be sorted, find files after or before a date, matching a date ...

Examples of running such option

  1. all files for a given date

    DATE=$(date +%Y-%m-%d)
    aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[? 
    contains(LastModified, `$DATE`)]'
    
  2. all files after a certain date

    export YESTERDAY=`date -v-1w +%F`
    aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[? 
    LastModified > `$YESTERDAY`)]'
    

s3api will return a few metadata so you can filter for specific elements

DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?contains(LastModified, `$DATE`)].Key'

OLD ANSWER

AWS-SDK/CLI really should implement some sort of retrieve-by-date flag, it would make life easier and cheaper.

If you have not prefixed/labelled your files with the dates, you may also want to try using the flag

--start-after (string)

If you know the latest file you want to start listing from, you can use the list-objects-v2 command with the --start-after flag.

"StartAfter is where you want Amazon S3 to start listing from. Amazon S3 starts listing after this specified key. StartAfter can be any key in the bucket"

So --start-after will continually get your objects, so if you would like to limit the number of items try specifying a --max-items flag.

https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects-v2.html

Solution 2

S3 can list all objects in a bucket, or all objects with a prefix (such as a "directory"). However, this isn't a cheap operation, it's certainly not designed to be done on every request.

Generally speaking, you are best served by a database layer for this. It can be something light and fast (like redis), but you should know what objects you have and which one you need for a given request.

You can somewhat cheat by copying objects twice- for instance, images/latest.jpg or images/user1/latest.jpg. But in the "date query" example, you should certainly do this external to S3.

Solution 3

You could store the files prefixed by date in the final directory eg:

images/user1/2016-01-12_{actual file name}

Then in the script that is doing the querying you can generate the list of dates in the time period and construct the prefixes accordingly and query S3 for all the dates separately and meagre the results. It should be way faster than getting full list and filtering the LastModified field (well that depends how many files you have in given dir, I think than anything less than a 1000 is not worth the effort.)

There is actually better method with use of 'Marker' parameter in listObjects call, so you set the marker to a key and listObjects will return only keys witch are after that one in the directory. We do have dates and time in the key names.

Share:
12,264
zebra
Author by

zebra

Updated on August 21, 2022

Comments

  • zebra
    zebra over 1 year

    I want to query items from S3 within a specific subdirectory in a bucket by the date/time that they were added to S3. I haven't been able to find any explicit documentation around this, so I'm wondering how it can be accomplished?

    The types of queries I want to perform look like this...

    1. Return the URL of the most recently created file in S3 bucket images under the directory images/user1/
    2. Return the URLs of all items created between datetime X and datetime Y in the S3 bucket images under the directory images/user1