How to query items from AWS S3 by date created
Solution 1
Update 3/19/2019
Apparently s3api allows you to do this quite easily
One solution would probably to use the s3api
. It works easily if you have less than 1000 objects, otherwise you need to work with pagination.
s3api
can list all objects and has a property for the lastmodified
attribute of keys imported in s3. It can then be sorted, find files after or before a date, matching a date ...
Examples of running such option
-
all files for a given date
DATE=$(date +%Y-%m-%d) aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[? contains(LastModified, `$DATE`)]'
-
all files after a certain date
export YESTERDAY=`date -v-1w +%F` aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[? LastModified > `$YESTERDAY`)]'
s3api will return a few metadata so you can filter for specific elements
DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?contains(LastModified, `$DATE`)].Key'
OLD ANSWER
AWS-SDK/CLI really should implement some sort of retrieve-by-date flag, it would make life easier and cheaper.
If you have not prefixed/labelled your files with the dates, you may also want to try using the flag
--start-after (string)
If you know the latest file you want to start listing from, you can use the list-objects-v2
command with the --start-after
flag.
"StartAfter is where you want Amazon S3 to start listing from. Amazon S3 starts listing after this specified key. StartAfter can be any key in the bucket"
So --start-after
will continually get your objects, so if you would like to limit the number of items try specifying a --max-items
flag.
https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects-v2.html
Solution 2
S3 can list
all objects in a bucket, or all objects with a prefix (such as a "directory"). However, this isn't a cheap operation, it's certainly not designed to be done on every request.
Generally speaking, you are best served by a database layer for this. It can be something light and fast (like redis), but you should know what objects you have and which one you need for a given request.
You can somewhat cheat by copying objects twice- for instance, images/latest.jpg
or images/user1/latest.jpg
. But in the "date query" example, you should certainly do this external to S3.
Solution 3
You could store the files prefixed by date in the final directory eg:
images/user1/2016-01-12_{actual file name}
Then in the script that is doing the querying you can generate the list of dates in the time period and construct the prefixes accordingly and query S3 for all the dates separately and meagre the results. It should be way faster than getting full list and filtering the LastModified field (well that depends how many files you have in given dir, I think than anything less than a 1000 is not worth the effort.)
There is actually better method with use of 'Marker' parameter in listObjects call, so you set the marker to a key and listObjects will return only keys witch are after that one in the directory. We do have dates and time in the key names.
zebra
Updated on August 21, 2022Comments
-
zebra over 1 year
I want to query items from S3 within a specific subdirectory in a bucket by the date/time that they were added to S3. I haven't been able to find any explicit documentation around this, so I'm wondering how it can be accomplished?
The types of queries I want to perform look like this...
- Return the URL of the most recently created file in S3 bucket
images
under the directoryimages/user1/
- Return the URLs of all items created between datetime
X
and datetimeY
in the S3 bucketimages
under the directoryimages/user1
- Return the URL of the most recently created file in S3 bucket