Iterating through all items in a DynamoDB table

10,943

Short answer

You are not doing anything wrong

Long answer

This is closely related to the way Amazon computes the capacity unit. First, it is extremely important to understand that:

  • capacity units == reserved computational units
  • capacity units != reserved network transit

Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan.

During a Scan operation, there is a fundamental distinction between

  • scanned Items: cumulated size is at most 1MB, may be below that size if limit is already reached
  • returned Items: all the matching items in the scanned Items

as the capacity unit is a compute unit, you pay for the scanned Items. Well, actually, you pay for the cumulated size of the scanned items. Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB

The scanned size does not depend on any filter, be it a field selector or a result filter.

From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm.

Another example

I have a test table which contains only very small elements. A Scan consumes only 1.0 Capacity unit to retrieve 100 Items because cumulated size < 2KB

Share:
10,943
ensnare
Author by

ensnare

Updated on September 14, 2022

Comments

  • ensnare
    ensnare over 1 year

    I'm trying to iterate through all items in my DynamoDB table. (I understand this is an inefficient process but am doing this one-time to build an index table.)

    I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results.

    Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. It's extremely slow.

    Here is my code:

    def search(table, scan_filter=None, range_key=None,
               attributes_to_get=None,
               limit=None):
        """ Scan a database for values and return
            a dict.
        """
    
        start_key = None
        num_results = 0
        total_results = []
        loop_iterations = 0
        request_limit = limit
    
        while num_results < limit:
            results = self.conn.layer1.scan(table_name=table,
                                      attributes_to_get=attributes_to_get,
                                      exclusive_start_key=start_key,
                                      limit=request_limit)
            num_results = num_results + len(results['Items'])
            start_key = results['LastEvaluatedKey']
            total_results = total_results + results['Items']
            loop_iterations = loop_iterations + 1
            request_limit = request_limit - results['Count']
    
            print "Count: " + str(results['Count'])
            print "Scanned Count: " + str(results['ScannedCount'])
            print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S'])
            print "Capacity: " + str(results['ConsumedCapacityUnits'])
            print "Loop Iterations: " + str(loop_iterations)
    
        return total_results
    

    Calling the function:

    db = DB()
    results = db.search(table='media',limit=500,attributes_to_get=['id'])
    

    And my output:

    Count: 96
    Scanned Count: 96
    Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o
    Capacity: 517.5
    Loop Iterations: 1
    Count: 109
    Scanned Count: 109
    Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY
    Capacity: 516.5
    Loop Iterations: 2
    Count: 104
    Scanned Count: 104
    Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg
    Capacity: 516.0
    Loop Iterations: 3
    Count: 104
    Scanned Count: 104
    Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds
    Capacity: 513.0
    Loop Iterations: 4
    Count: 100
    Scanned Count: 100
    Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY
    Capacity: 516.5
    Loop Iterations: 5
    

    Is this expected behavior? Or, what am I doing wrong?