Difference between Internal Hash and External Hash

hash hashmap hashtable

15,264

The internal hashing is an array that contains the address of the hash key. Therefore every array index can only contain one address of the hash key so if another hash key assigns to the same index of the array this will cause a collision. On the other hand, external hashing is mainly buckets of M. And every bucket can occupy more than one hash key. Therefore collision will happen when the bucket gets full.

15,264

Author by

zer0uno

Updated on June 04, 2022

Comments

zer0uno almost 2 years

From the book 'Fundamentals of Database Systems':

Internal Hash

For internal files, hashing is typically implemented as a hash table through the use of an array of records. Suppose that the array index range is from 0 to M – 1, as shown in Figure 17.8(a); then we have M slots whose addresses correspond to the array indexes. We choose a hash function that transforms the hash field value into an integer between 0 and M − 1. One common hash function is the h(K) = K mod M function, which returns the remainder of an integer hash field value K after divi- sion by M; this value is then used for the record address. [...]

A collision occurs when the hash field value of a record that is being inserted hashes to an address that already contains a different record.

External Hash

Hashing for disk files is called external hashing. To suit the characteristics of disk storage, the target address space is made of buckets, each of which holds multiple records. A bucket is either one disk block or a cluster of contiguous disk blocks. The hashing function maps a key into a relative bucket number, rather than assigning an absolute block address to the bucket. A table maintained in the file header converts the bucket number into the corresponding disk block address, as illustrated in Figure 17.9. The collision problem is less severe with buckets, because as many records as will fit in a bucket can hash to the same bucket without causing problems.

I have the following question:
1) A record is always recorded inside a block, so does the internal hash return the block-address and the position of the record inside the block?
2) Why is the collision problem less severe with external hash? I mean, let's suppose that every block can store 10 records; I speculate that the file I will store contains 100 records, then, using external hash, I allocate maybe 11-12 buckets(I assume that a bucket=1 block), so the hash function will return 10-12 address to the buckets.
If I use an internal hash, because the hash function returns a direct address, I would use a functions that return to me about 100-120 addresses.
So what's the difference? In this way I think I have the same probability of collision using the two methods.