Complete scan of dynamoDb with boto3
Solution 1
I think the Amazon DynamoDB documentation regarding table scanning answers your question.
In short, you'll need to check for LastEvaluatedKey
in the response. Here is an example using your code:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
Solution 2
boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something
Solution 3
DynamoDB limits the scan
method to 1mb of data per scan.
Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey
:
import boto3
client = boto3.client('dynamodb')
def dump_table(table_name):
results = []
last_evaluated_key = None
while True:
if last_evaluated_key:
response = client.scan(
TableName=table_name,
ExclusiveStartKey=last_evaluated_key
)
else:
response = client.scan(TableName=table_name)
last_evaluated_key = response.get('LastEvaluatedKey')
results.extend(response['Items'])
if not last_evaluated_key:
break
return results
# Usage
data = dump_table('your-table-name')
# do something with data
Solution 4
Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression
in with the pagination:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
Solution 5
Code for deleting dynamodb format type as @kungphu mentioned.
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)
Related videos on Youtube
CJ_Spaz
Updated on July 08, 2022Comments
-
CJ_Spaz almost 2 years
My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
Any assistance would be appreciated.
import boto3 dynamodb = boto3.resource('dynamodb', aws_session_token = aws_session_token, aws_access_key_id = aws_access_key_id, aws_secret_access_key = aws_secret_access_key, region_name = region ) table = dynamodb.Table('widgetsTableName') data = table.scan()
-
kungphu almost 8 yearsNote that the items in
page['Items']
may not be what you're expecting: Since this paginator is painfully generic, what you'll get back for each DynamoDB item is a dictionary of format type: value, e.g.{'myAttribute': {'M': {}}, 'yourAttribute': {'N': u'132457'}}
for a row with an empty map and a numeric type (which is returned as a string that needs to be cast; I suggestdecimal.Decimal
for this since it already takes a string and will handle non-integer numbers). Other types, e.g. strings, maps, and booleans, are converted to their Python types by boto. -
MuntingInsekto almost 8 yearsis it possbile to have a scan filter or filterexpression with pagination?
-
kungphu over 7 yearsWhile this may work, note that the boto3 documentation states If LastEvaluatedKey is empty, then the "last page" of results has been processed and there is no more data to be retrieved. So the test I'm using is
while response.get('LastEvaluatedKey')
rather thanwhile 'LastEvaluatedKey' in response
, just because "is empty" doesn't necessarily mean "isn't present," and this works in either case. -
Bruce Edge about 7 yearspaginators would be great, if it weren't for the issue @kungphu raised. I don't see the use for something that does one useful thing, but negates it by polluting the response data with irrelevant metadata
-
Bruce Edge about 7 yearsBravo! negates my earlier comment above about the lack of usefulness of paginators. thanks! Why is this not the default behavior?
-
iuriisusuk about 6 yearspaginator is more convenient way to iterate through queried/scanned items
-
John Jang about 5 years@kungphu
response.get('LastEvaluatedKey')
I gotNone
, it can't apply the condition towhile
loop. -
kungphu about 5 years@John_J You could use
while True:
and thenif not response.get('LastEvaluatedKey'): break
or something similar. You could also put your processing in a function, call it, and then use thewhile response.get(...):
above to call it agin to process subsequent pages. You basically just need to emulatedo... while
, which does not explicitly exist in Python. -
Hephaestus almost 5 yearsWhy not use:
while response.get('LastEvaluatedKey', False)
? -
Dan Hook almost 5 yearsI had issues with LastEvaluatedKey being transformed and that messed up the paginator.
-
bruce szalwinski over 4 yearsSmall change to the example to show table arg, all_items = list(iterate_paged_results(table.scan, ProjectionExpression = 'my_field'))
-
D.Tate about 4 years@kungphu @Bruce, curious if yall are aware of any recent improvements for this "polluted" dictionary approach ? I'm thinking of switching back to resource instead of client, and just using
LastEvaluatedKey
approach .. it just feels like too much to have to paginate and then have to parse out the response -
D.Tate about 4 yearsAh nevermind... I think I found my answer in the form of
TypeDeserializer
(stackoverflow.com/a/46738251/923817). Sweet! -
kungphu about 4 years@D.Tate Glad you found your solution. My work lately is all in Clojure, and the libraries are much less obtuse (though it only gets so good, working with Amazon's APIs). :) And thank you for linking that here, for others who might find this question later!
-
DenCowboy about 4 yearswhat represents a contain filter?
-
Abe Voelker about 4 years@DenCowboy I think the FilterExpression would just look like
'FilterExpression': 'contains(Color, :x)'
. See the CLI example here: docs.aws.amazon.com/amazondynamodb/latest/developerguide/… -
demosito almost 4 yearsI actually like this solution the most. It combines the simplicity of items access and abstracts the pagination away. The only complaint I have is that it's overengineered a bit, the same could be done with a single function and without functools - just yield each item from
response["Items"]
both times. -
Jon H over 3 years@Hephaestus that would work as well, but its not necessary. .get returns None by default if the requested key is not there, None evaluates to False. This can be confirmed by running
bool({}.get('test'))
-
captainblack over 3 yearsThis was perfect!