Iterating through all items in a DynamoDB table

python amazon-web-services amazon-dynamodb

10,943

Short answer

You are not doing anything wrong

Long answer

This is closely related to the way Amazon computes the capacity unit. First, it is extremely important to understand that:

capacity units == reserved computational units
capacity units != reserved network transit

Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan.

During a Scan operation, there is a fundamental distinction between

scanned Items: cumulated size is at most 1MB, may be below that size if limit is already reached
returned Items: all the matching items in the scanned Items

as the capacity unit is a compute unit, you pay for the scanned Items. Well, actually, you pay for the cumulated size of the scanned items. Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB

The scanned size does not depend on any filter, be it a field selector or a result filter.

From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm.

Another example

I have a test table which contains only very small elements. A Scan consumes only 1.0 Capacity unit to retrieve 100 Items because cumulated size < 2KB

10,943

Author by

ensnare

Updated on September 14, 2022

Comments

ensnare over 1 year

I'm trying to iterate through all items in my DynamoDB table. (I understand this is an inefficient process but am doing this one-time to build an index table.)

I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results.

Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. It's extremely slow.

Here is my code:

def search(table, scan_filter=None, range_key=None,
           attributes_to_get=None,
           limit=None):
    """ Scan a database for values and return
        a dict.
    """

    start_key = None
    num_results = 0
    total_results = []
    loop_iterations = 0
    request_limit = limit

    while num_results < limit:
        results = self.conn.layer1.scan(table_name=table,
                                  attributes_to_get=attributes_to_get,
                                  exclusive_start_key=start_key,
                                  limit=request_limit)
        num_results = num_results + len(results['Items'])
        start_key = results['LastEvaluatedKey']
        total_results = total_results + results['Items']
        loop_iterations = loop_iterations + 1
        request_limit = request_limit - results['Count']

        print "Count: " + str(results['Count'])
        print "Scanned Count: " + str(results['ScannedCount'])
        print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S'])
        print "Capacity: " + str(results['ConsumedCapacityUnits'])
        print "Loop Iterations: " + str(loop_iterations)

    return total_results

Calling the function:

db = DB()
results = db.search(table='media',limit=500,attributes_to_get=['id'])

And my output:

Count: 96
Scanned Count: 96
Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o
Capacity: 517.5
Loop Iterations: 1
Count: 109
Scanned Count: 109
Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY
Capacity: 516.5
Loop Iterations: 2
Count: 104
Scanned Count: 104
Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg
Capacity: 516.0
Loop Iterations: 3
Count: 104
Scanned Count: 104
Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds
Capacity: 513.0
Loop Iterations: 4
Count: 100
Scanned Count: 100
Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY
Capacity: 516.5
Loop Iterations: 5

Is this expected behavior? Or, what am I doing wrong?