Complete scan of dynamoDb with boto3

106,929

Solution 1

I think the Amazon DynamoDB documentation regarding table scanning answers your question.

In short, you'll need to check for LastEvaluatedKey in the response. Here is an example using your code:

import boto3
dynamodb = boto3.resource('dynamodb',
                          aws_session_token=aws_session_token,
                          aws_access_key_id=aws_access_key_id,
                          aws_secret_access_key=aws_secret_access_key,
                          region_name=region
)

table = dynamodb.Table('widgetsTableName')

response = table.scan()
data = response['Items']

while 'LastEvaluatedKey' in response:
    response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
    data.extend(response['Items'])

Solution 2

boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:

import boto3

client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')

for page in paginator.paginate():
    # do something

Solution 3

DynamoDB limits the scan method to 1mb of data per scan.

Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan

Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:

import boto3
client = boto3.client('dynamodb')

def dump_table(table_name):
    results = []
    last_evaluated_key = None
    while True:
        if last_evaluated_key:
            response = client.scan(
                TableName=table_name,
                ExclusiveStartKey=last_evaluated_key
            )
        else: 
            response = client.scan(TableName=table_name)
        last_evaluated_key = response.get('LastEvaluatedKey')
        
        results.extend(response['Items'])
        
        if not last_evaluated_key:
            break
    return results

# Usage
data = dump_table('your-table-name')

# do something with data

Solution 4

Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression in with the pagination:

import boto3

client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
  'TableName': 'foo',
  'FilterExpression': 'bar > :x AND bar < :y',
  'ExpressionAttributeValues': {
    ':x': {'S': '2017-01-31T01:35'},
    ':y': {'S': '2017-01-31T02:08'},
  }
}

page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
    # do something

Solution 5

Code for deleting dynamodb format type as @kungphu mentioned.

import boto3

from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector

client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
    trans.inject_attribute_value_output(page, service_model)
Share:
106,929

Related videos on Youtube

CJ_Spaz
Author by

CJ_Spaz

Updated on July 08, 2022

Comments

  • CJ_Spaz
    CJ_Spaz almost 2 years

    My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.

    Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.

    Any assistance would be appreciated.

    import boto3
    dynamodb = boto3.resource('dynamodb',
        aws_session_token = aws_session_token,
        aws_access_key_id = aws_access_key_id,
        aws_secret_access_key = aws_secret_access_key,
        region_name = region
        )
    
    table = dynamodb.Table('widgetsTableName')
    
    data = table.scan()
    
  • kungphu
    kungphu almost 8 years
    Note that the items in page['Items'] may not be what you're expecting: Since this paginator is painfully generic, what you'll get back for each DynamoDB item is a dictionary of format type: value, e.g. {'myAttribute': {'M': {}}, 'yourAttribute': {'N': u'132457'}} for a row with an empty map and a numeric type (which is returned as a string that needs to be cast; I suggest decimal.Decimal for this since it already takes a string and will handle non-integer numbers). Other types, e.g. strings, maps, and booleans, are converted to their Python types by boto.
  • MuntingInsekto
    MuntingInsekto almost 8 years
    is it possbile to have a scan filter or filterexpression with pagination?
  • kungphu
    kungphu over 7 years
    While this may work, note that the boto3 documentation states If LastEvaluatedKey is empty, then the "last page" of results has been processed and there is no more data to be retrieved. So the test I'm using is while response.get('LastEvaluatedKey') rather than while 'LastEvaluatedKey' in response, just because "is empty" doesn't necessarily mean "isn't present," and this works in either case.
  • Bruce Edge
    Bruce Edge about 7 years
    paginators would be great, if it weren't for the issue @kungphu raised. I don't see the use for something that does one useful thing, but negates it by polluting the response data with irrelevant metadata
  • Bruce Edge
    Bruce Edge about 7 years
    Bravo! negates my earlier comment above about the lack of usefulness of paginators. thanks! Why is this not the default behavior?
  • iuriisusuk
    iuriisusuk about 6 years
    paginator is more convenient way to iterate through queried/scanned items
  • John Jang
    John Jang about 5 years
    @kungphu response.get('LastEvaluatedKey') I got None, it can't apply the condition to while loop.
  • kungphu
    kungphu about 5 years
    @John_J You could use while True: and then if not response.get('LastEvaluatedKey'): break or something similar. You could also put your processing in a function, call it, and then use the while response.get(...): above to call it agin to process subsequent pages. You basically just need to emulate do... while, which does not explicitly exist in Python.
  • Hephaestus
    Hephaestus almost 5 years
    Why not use: while response.get('LastEvaluatedKey', False)?
  • Dan Hook
    Dan Hook almost 5 years
    I had issues with LastEvaluatedKey being transformed and that messed up the paginator.
  • bruce szalwinski
    bruce szalwinski over 4 years
    Small change to the example to show table arg, all_items = list(iterate_paged_results(table.scan, ProjectionExpression = 'my_field'))
  • D.Tate
    D.Tate about 4 years
    @kungphu @Bruce, curious if yall are aware of any recent improvements for this "polluted" dictionary approach ? I'm thinking of switching back to resource instead of client, and just using LastEvaluatedKey approach .. it just feels like too much to have to paginate and then have to parse out the response
  • D.Tate
    D.Tate about 4 years
    Ah nevermind... I think I found my answer in the form of TypeDeserializer (stackoverflow.com/a/46738251/923817). Sweet!
  • kungphu
    kungphu about 4 years
    @D.Tate Glad you found your solution. My work lately is all in Clojure, and the libraries are much less obtuse (though it only gets so good, working with Amazon's APIs). :) And thank you for linking that here, for others who might find this question later!
  • DenCowboy
    DenCowboy about 4 years
    what represents a contain filter?
  • Abe Voelker
    Abe Voelker about 4 years
    @DenCowboy I think the FilterExpression would just look like 'FilterExpression': 'contains(Color, :x)'. See the CLI example here: docs.aws.amazon.com/amazondynamodb/latest/developerguide/…
  • demosito
    demosito almost 4 years
    I actually like this solution the most. It combines the simplicity of items access and abstracts the pagination away. The only complaint I have is that it's overengineered a bit, the same could be done with a single function and without functools - just yield each item from response["Items"] both times.
  • Jon H
    Jon H over 3 years
    @Hephaestus that would work as well, but its not necessary. .get returns None by default if the requested key is not there, None evaluates to False. This can be confirmed by running bool({}.get('test'))
  • captainblack
    captainblack over 3 years
    This was perfect!