Why are hash table expansions usually done by doubling the size?

18,232

Solution 1

Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.

Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.

Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).

Solution 2

I had read a very interesting discussion on growth strategy on this very site... just cannot find it again.

While 2 is commonly used, it's been demonstrated that it was not the best value. One often cited problem is that it does not cope well with allocators schemes (which often allocate power of twos blocks) since it would always require a reallocation while a smaller number might in fact be reallocated in the same block (simulating in-place growth) and thus being faster.

Thus, for example, the VC++ Standard Library uses a growth factor of 1.5 (ideally should be the golden number if a first-fit memory allocation strategy is being used) after an extensive discussion on the mailing list. The reasoning is explained here:

I'd be interested if any other vector implementations uses a growth factor other than 2, and I'd also like to know whether VC7 uses 1.5 or 2 (since I don't have that compiler here).

There is a technical reason to prefer 1.5 to 2 -- more specifically, to prefer values less than 1+sqrt(5)/2.

Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.

It turns out that if the growth factor is >= 1+sqrt(5)/2, the new memory will always be too big for the hole that has been left sofar; if it is < 1+sqrt(5)/2, the new memory will eventually fit. So 1.5 is small enough to allow the memory to be recycled.

Surely, if the growth factor is >= 2 the new memory will always be too big for the hole that has been left so far; if it is < 2, the new memory will eventually fit. Presumably the reason for (1+sqrt(5))/2 is...

  • Initial allocation is s.
  • The first resize is k*s.
  • The second resize is k*k*s, which will fit the hole iff k*k*s <= k*s+s, i.e. iff k <= (1+sqrt(5))/2

...the hole can be recycled asap.

It could, by storing its previous size, grow fibonaccily.

Of course, it should be tailored to the memory allocation strategy.

Solution 3

One reason for doubling size that is specific to hash containers is that if the container capacity is always a power of two, then instead of using a general purpose modulo for converting a hash to an offset, the same result can be achieved with bit shifting. Modulo is a slow operation for the same reasons that integer division is slow. (Whether integer division is "slow" in the context of whatever else is going in a program is of course case dependent but it's certainly slower than other basic integer arithmetic.)

Solution 4

The same reasoning applies for doubling the size as for vector/ArrayList implementations, see this answer.

Solution 5

Doubling the memory when expanding any type of collection is an oftenly used strategy to prevent memory fragmentation and not having to reallocate too often. As you point out there might be reasons to have a prime number of elements. When knowing your application and your data, you might also be able to predict the growth of the number of elements and thus choose another (larger or smaller) growth factor than doubling.

The general implementations found in libraries are exactly that: General implementations. They have to focus on being a reasonable choice in a variety of different situations. When knowing the context, it is almost always possible to write a more specialized and more efficient implementation.

Share:
18,232
Chirael
Author by

Chirael

Updated on June 17, 2022

Comments

  • Chirael
    Chirael almost 2 years

    I've done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor like 75%) the hash table should be expanded.

    Almost always, the recommendation is to double (or double plus 1, i.e., 2n+1) the size of the hash table. However, I haven't been able to find a good reason for this.

    Why double the size, rather than, say, increasing it 25%, or increasing it to the size of the next prime number, or next k prime numbers (e.g., three)?

    I already know that it's often a good idea to choose an initial hash table size which is a prime number, at least if your hash function uses modulus such as universal hashing. And I know that's why it's usually recommended to do 2n+1 instead of 2n (e.g., http://www.concentric.net/~Ttwang/tech/hashsize.htm)

    However as I said, I haven't seen any real explanation for why doubling or doubling-plus-one is actually a good choice rather than some other method of choosing a size for the new hash table.

    (And yes I've read the Wikipedia article on hash tables :) http://en.wikipedia.org/wiki/Hash_table

  • Pascal Cuoq
    Pascal Cuoq about 14 years
    @andras Right. 1.5 and 3 and just examples of reasonable values between which the doubling strategy makes the load factor vary. I took the values I knew to be used in my favorite language, but there's nothing special about them.
  • Pascal Cuoq
    Pascal Cuoq about 14 years
    @andras I changed the formulation. If you looked it up, out of curiosity, what are the values for .Net and Java?
  • Pascal Cuoq
    Pascal Cuoq about 14 years
    @andras because then, the amortized insertion time will in practice be dominated by the cost of resizing (you have to compute again the hash of all stored values when you resize). Sure, you could use 1.5 or 3, a factor of 2 just happens to balance the various costs in an acceptable manner. This is purely heuristic, any factor >1 gives the same asymptotic complexity. And even if you know what kind of memory/speed trade-off you are aiming for, there is no optimal value because everything depends on the costs of the provided hash and equality functions anyway.
  • Pascal Cuoq
    Pascal Cuoq about 14 years
    @andras Well, if you get to the bottom of it, yes, it's the time taken by the numerous resize operations on the one hand, and the space wasted just after a resize on the other hand. Why did you delete the comment with the values in .Net and Java?
  • Pascal Cuoq
    Pascal Cuoq about 14 years
    @andras It's really too bad the new version is inferior, because I edited it (repeatedly) on your suggestion. Have you considered the idea of editing your own answer to give us your views on hash tables? PS: feel free to vote this answer as you see fit.
  • Andras Vass
    Andras Vass about 14 years
    @Pascal: I suggested you remove the parts about the load factors. For one, values outside 0.5 and 3 can just as well be acceptable depending on your needs. For two: a load factor of 3 is clearly not possible with an open addressing hash table that are just as common as chaining ones. For three: no, you do not need to take an average bucket occupation as low as 0.5. In fact, you may be happy with a minimum load factor of 1 with your chaining hash table and a value of 0.25 for your open addressing one. Open addressing ones usually operate with a load factor below or at 3/4.
  • Andras Vass
    Andras Vass about 14 years
    @Pascal: I did not suggest that doubling will not keep the load factor between bounds. I suggested that any number would keep it between bounds. Certainly, a growth factor of 1.1 will keep it between even tighter bounds. So what you state is clearly not the reason for choosing 2.
  • Praxeolitic
    Praxeolitic about 8 years
    Which implementation is this for and can you link to the discussion?
  • Matthieu M.
    Matthieu M. about 8 years
    @Praxeolitic: I have the memory of a Golden Fish, so unfortunately no I cannot remember exactly what I meant 6 years ago (already!). That being said, inquiries (Google) show that discussions occurred on comp.lang.c++.moderated in 2003 and that at the time Dirkumware (VC++) already used a vector of 1.5. There was a discussion on gcc in 2004 about switching from 2 to 1.5 but given this thread from 2013 it did not bulge then.
  • Praxeolitic
    Praxeolitic about 8 years
    Neat, I hadn't seen that golden number argument before. For anyone who, like me, finds the comp.lang.c++.moderated thread confusing, this explanation is a bit fuller.
  • Matthieu M.
    Matthieu M. about 8 years
    @Praxeolitic: Neat link! I am personally (now) wondering if 1.5 is still a better factor than 2 with modern allocators such as jemalloc/tcmalloc which use size-buckets (with custom growth factor between buckets) it's a little less clear; it might work with C++11 and trivial move constructors in which case you can use realloc would keep the item in the same size bucket... but really what we see here is a limitation in the std::allocator design => when you ask for N bytes of memory, it should not only give you a memory block of at least N bytes, but also tell how many bytes the block contain
  • Praxeolitic
    Praxeolitic about 8 years
    Let's see, some of the logic for this in Facebook's FBVector is here under function computePushBackCapacity. It first increases according to bucket sizes when it's small (in practice just a growth factor of 2), then grows by 1.5 when it's medium sized, and finally grows by 2 again (I guess to use whole memory pages).
  • Developer Sheldon
    Developer Sheldon over 5 years
    That leaves a gap, what is the gap you mean here? I am trying to use the links you guys gave, but most are expired.
  • Matthieu M.
    Matthieu M. over 5 years
    @DeveloperSheldon: This is the teaser; the explanation takes the rest of the quote. Essentially, the idea is that if malloc allocated 2 bytes, then 4 bytes, and you now ask for 8 bytes, it cannot combine 2+4 to satisfy the allocation, so it needs a brand new 8 bytes. And the next time, as you request 16 bytes, it cannot combine 2+4+8 to satisfy the allocation, etc... Now, at small sizes this is unlikely to happen, however at larger sizes (MB-range and above) then it does happen. Proponents of using 1.5 therefore make the case that with 1.5 recycling works: 2+4 >= 6 = 4*1.5.
  • Developer Sheldon
    Developer Sheldon over 5 years
    @MatthieuM. appreciated you detail answer. That makes whole a lot sense to me. :)