Python memory consumption: dict VS list of tuples

23,635

Solution 1

Your list of tuples adds an extra layer. You have 3 layers of items:

  • The outer list of length 1 million, so 1 million pointers
    • 1 million 2-slot tuples, so 2 million pointers
      • 2 million references to 1 million integer values

while your dict only holds:

  • The dict (including 1 million cached hashes) with 2 million pointers + extra space to grow the table
    • 2 million references to 1 million integer values

It's those 1 million tuples plus the list to hold the references to them that take up more memory than the 1 million cached hashes. There are some 50% more pointers involved here, easily accounting for the 50% more memory use you see.

There is another downside to your list of tuples: lookup time. To find a matching key in the dict, there is a O(1) complexity cost. To do the same in the list of tuples, you have to potentially scan the whole list for a O(n) cost. Don't use a list of tuples if you need to map keys to values.

Solution 2

You're actually getting an incomplete picture of memory use in this case. The total size of a dictionary more than doubles at irregular intervals, and if you compare the size of these two structures right after the dictionary size has increased, it's bigger again. A simple script with a recursive size function (see code below) shows a pretty clear pattern:

i:  2  list size:  296  dict size:  328  difference:  -32
i:  3  list size:  392  dict size:  352  difference:  40
i:  4  list size:  488  dict size:  376  difference:  112
i:  5  list size:  616  dict size:  400  difference:  216
i:  7  list size:  808  dict size:  1216  difference:  -408
i:  10  list size:  1160  dict size:  1288  difference:  -128
i:  13  list size:  1448  dict size:  1360  difference:  88
i:  17  list size:  1904  dict size:  1456  difference:  448
i:  23  list size:  2480  dict size:  3904  difference:  -1424
i:  31  list size:  3328  dict size:  4096  difference:  -768
i:  42  list size:  4472  dict size:  4360  difference:  112
i:  56  list size:  5912  dict size:  4696  difference:  1216
i:  74  list size:  7880  dict size:  5128  difference:  2752
i:  100  list size:  10520  dict size:  14968  difference:  -4448
i:  133  list size:  14024  dict size:  15760  difference:  -1736
i:  177  list size:  18672  dict size:  16816  difference:  1856

This pattern continues as i grows. (You can test this using your method -- try setting i near 2636744. The size of the dictionary is larger at that point, at least for me.) Martijn is right that the tuples in the list of tuples add to the memory overhead, canceling out the memory advantage of lists over dictionaries. But the result, on average, is not that the dictionary is better; it's that the dictionary is about the same. So in answer to your original question:

When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?

It doesn't really matter if all you're concerned about is memory.

However, note that iterating over a dictionary is often a bit slower than iterating over a list, because there's no good way to avoid iterating over all the empty bins in the dictionary. So there's a bit of a tradeoff -- dictionaries are (much) faster at doing random key lookups, but lists are (a bit) faster at iteration. The dictionary will probably be better most of the time, but in some rare cases, the list may provide a micro-optimization.


Here's the code that tests for size. It probably won't generate correct results for all corner cases, but it should handle simple structures like this without any problems. (But let me know if you see any issues.)

import sys, collections, itertools, math

def totalsize(x):
    seen = set()
    return ts_rec(x, seen)

def ts_rec(x, seen):
    if id(x) in seen:
        return 0
    else:
        seen.add(id(x))

    x_size = sys.getsizeof(x)
    if isinstance(x, collections.Mapping):
        kv_chain = itertools.chain.from_iterable(x.iteritems())
        return x_size + sum(ts_rec(i, seen) for i in kv_chain)
    elif isinstance(x, collections.Sequence):
        return x_size + sum(ts_rec(i, seen) for i in x)
    else:
        return x_size

for i in (10 ** (e / 8.0) for e in range(3, 19)):
    i = int(i)
    lsize = totalsize([(x, x) for x in xrange(i)])
    dsize = totalsize(dict((x, x) for x in xrange(i)))

    print "i: ", i,
    print " list size: ", lsize, " dict size: ", dsize,
    print " difference: ", lsize - dsize
Share:
23,635
RayLuo
Author by

RayLuo

Long time programmer. Using C++ in my first half of career and then switch to Python for 8 years. Focus on web backend, RESTful API framework.

Updated on April 17, 2020

Comments

  • RayLuo
    RayLuo about 4 years

    There are plenty of questions and discussion about memory consumption of different python data types. Yet few of them (if any) come to a very specific scenario. When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?

    At beginning I thought dict is more powerful than list of tuples and that power must come with some price, and actually an empty dict DOES occupy more memory than an empty list or tuple (see In-memory size of a Python structure), so I thought using [(key1, value1), (key2, value2), ...] would be more memory-efficient than {key1: value1, key2: value2, ...}.

    Looks like I was wrong. Just fire up the following code snippet, and see the mem consumption reported by your OS. I am using Windows XP so that task manager tells me, a large dict eats up "only" 40MB Ram and 40MB VIRTURAL Ram, but a list of tuples eats up 60MB Ram and 60MB Virtual ram.

    How could that be?

    from sys import getsizeof as g
    raw_input('ready, press ENTER')
    i = 1000000
    #p = [(x, x) for x in xrange(i)] # Will print 4,348,736 40,348,736
    p = dict((x, x) for x in xrange(i)) # Will print 25,165,964 37,165,964
    print g(p), g(p) + sum(g(x) for x in p)
    raw_input("Check your process's memory consumption now, press ENTER to exit")
    

    Update:

    Thanks for some of the comments below. I wanna clarify: I'm talking about memory-efficiency. And no, in this case no need to worry about key-value lookup efficiency, let's just assume my algorithm will consume them one by one via iterator.