Iterating through all items in a DynamoDB table
Short answer
You are not doing anything wrong
Long answer
This is closely related to the way Amazon computes the capacity unit. First, it is extremely important to understand that:
capacity units == reserved computational units
capacity units != reserved network transit
Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan
.
During a Scan
operation, there is a fundamental distinction between
- scanned Items: cumulated size is at most 1MB, may be below that size if
limit
is already reached - returned Items: all the matching items in the scanned Items
as the capacity unit
is a compute unit, you pay for the scanned Items. Well, actually, you pay for the cumulated size of the scanned items. Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB
The scanned size does not depend on any filter, be it a field selector or a result filter.
From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm.
Another example
I have a test table which contains only very small elements. A Scan consumes only 1.0
Capacity unit to retrieve 100 Items because cumulated size < 2KB
ensnare
Updated on September 14, 2022Comments
-
ensnare over 1 year
I'm trying to iterate through all items in my DynamoDB table. (I understand this is an inefficient process but am doing this one-time to build an index table.)
I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results.
Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. It's extremely slow.
Here is my code:
def search(table, scan_filter=None, range_key=None, attributes_to_get=None, limit=None): """ Scan a database for values and return a dict. """ start_key = None num_results = 0 total_results = [] loop_iterations = 0 request_limit = limit while num_results < limit: results = self.conn.layer1.scan(table_name=table, attributes_to_get=attributes_to_get, exclusive_start_key=start_key, limit=request_limit) num_results = num_results + len(results['Items']) start_key = results['LastEvaluatedKey'] total_results = total_results + results['Items'] loop_iterations = loop_iterations + 1 request_limit = request_limit - results['Count'] print "Count: " + str(results['Count']) print "Scanned Count: " + str(results['ScannedCount']) print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S']) print "Capacity: " + str(results['ConsumedCapacityUnits']) print "Loop Iterations: " + str(loop_iterations) return total_results
Calling the function:
db = DB() results = db.search(table='media',limit=500,attributes_to_get=['id'])
And my output:
Count: 96 Scanned Count: 96 Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o Capacity: 517.5 Loop Iterations: 1 Count: 109 Scanned Count: 109 Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY Capacity: 516.5 Loop Iterations: 2 Count: 104 Scanned Count: 104 Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg Capacity: 516.0 Loop Iterations: 3 Count: 104 Scanned Count: 104 Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds Capacity: 513.0 Loop Iterations: 4 Count: 100 Scanned Count: 100 Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY Capacity: 516.5 Loop Iterations: 5
Is this expected behavior? Or, what am I doing wrong?