hash function for string

c algorithm hash dictionary hashtable

304,580

Solution 1

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

Solution 2

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

Solution 3

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

Solution 4

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.

If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

Solution 5

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

uint32_t inline MurmurOAAT32 ( const char * key)
{
  uint32_t h(3323198485ul);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
  uint64_t h(525201411107845655ull);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e9955bd1e995;
    h ^= h >> 47;
  }
  return h;
}

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

View more solutions

304,580

lilawood

Updated on July 08, 2022

Comments

lilawood almost 2 years
I'm working on hash table in C language and I'm testing hash function for string.

The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.

The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
- Which is the best algorithm for hashing string?
- And how to determinate the size of hash table?
- Carey Gregory over 12 years
  
  If your hash table has 10K entries, why would you use modulo 100? Getting 40 collisions out of 130 words isn't surprising with such a small modulus.
- Admin over 12 years
  
  There are many string hash implementations available on both google and SO (read: more searching is in order). Many approaches use a "barrel shift" or "rolling" hash (possibly with "mixing" phases) -- but please pay heed to Gregory!
- Admin over 12 years
  
  See burtleburtle.net/bob/hash/evahash.html and partow.net/programming/hashfunctions for which are resources about various hashing (from general to string to crypto).
- derobert over 12 years
  
  To clarify @CareyGregory: You do realize that, as a basic mathematical truth, 130 items in 100 buckets (i.e., mod 100) must produce 30 collisions (where collision is counted as each time a second, third, etc. item is put in a bucket), correct? So you're only a little above that.
- Carey Gregory over 12 years
  
  @derobert: Yes, a minimum of 30 collisions. But that's not my question. The OP states the hash table has 10K entries. So why a modulus of 100? Unless I misunderstand what lilawood meant, that would leave 9900 entries in the hash table unusable.
- derobert over 12 years
  
  @CareyGregory: I think its the 8,000 word version that is using the 10,000-entry hash—hopefully with a higher modulus. I'm guessing the 130-word version (with mod 100) is a reduced-size problem for easier debugging, etc. That clarification is aimed at lilawood, to explain why (as you commented) 40 collisions isn't surprising.
- lilawood over 12 years
  
  I've tried only a small part of input data to test the algorithm as @derobert said.
- Carey Gregory over 12 years
  
  @lilawood: OK, that's what I figured, but to be a better test you should use 80 words with a hash table of 100 entries. That would give you the same proportions as your live data and wouldn't force collisions.
- M.J. Rayburn over 6 years
  
  Possible duplicate of Good Hash Function for Strings
- Yann Droneaud almost 6 years
  
  the best answers for this question can be found on another stackexchange site: softwareengineering.stackexchange.com/questions/49550/…
- Gabriel Staples about 2 years
  
  Related: Can CRC32 be used as a hash function?
Adrien Plisson over 12 years

the page linked in the answer is very interesting.
Daniel N. about 9 years

how the program runs out of the while loop?? =S
rxantos almost 9 years

@danfly09 When c is zero. The equivalent of while(c = *str++) would be (0 != (c = *str++))
Suragch almost 9 years

This isn't c, but I would be interested in your thoughts to this related answer: stackoverflow.com/a/31440118/3681880
Jerry Coffin almost 9 years

@Suragch: Since I wrote this, quite a few processors have started to include either special hardware to accelerate SHA computation, which has made it much more competitive. That said, I doubt your code is quite as safe as you think--for example, IEEE floating point numbers have two different bit patterns (0 and -0) that should produce the same hashes (they'll compare as equal to each other).
Josepas about 8 years

I dont understand how this works, this is gonna give me a big number right? what do I have to do if I have a table of size 13, take the mod 13 of the number this function gives to me?
Sandeep almost 8 years

hsearch searches by comparing the strings or string ptr address? I think it is just checking the ptr address? I tried using different pointers but same string calue. hsearch fails stating no elements found
WhozCraig almost 8 years

@Josepas the hash function should ideally return a size_t or other such unsigned value (such as the unsigned long in this code). The caller is responsible for taking modulo of the result to fit it to the hash table. The caller controls the table slot being hashed to; not the function. It just returns some unsigned number.
David Haim over 7 years

amazing. this algorithm beat the hell out of Murmur hash, FNV variants hashes and many others! +1
thewaywewere almost 7 years

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
Gabriel Staples almost 7 years

I just posted the K&R 1st & 2nd edition hash algorithms here (stackoverflow.com/a/45641002/4561887) too. Although the 1st edition is purported by Dan Bernstein to be pretty terrible, I'm guessing the 2nd edition one is pretty decent. I think it's good to be aware of both of these for at least historical reasons and general awareness.
Gabriel Staples almost 7 years

Upvoted for a good table, put posting the source code for each of those hashes in your answer is essential too. Otherwise, the links may break and we are out of luck.
CopperCash over 6 years

Why isn't c of unsigned char type or unsigned long type. Is there any special reason for it to be of int type?
user2262111 about 6 years

@CopperCash The reason so is because an int == char, but save headache of worrying about <<'s and >>'s, esp in unoptimized code - remember a computer allocates char as a WORD and not a singular byte (depending on architecture). Fortunately, signedness of these characters do not matter, because how C handles it on + c; Check it out here ideone.com/4XasWc
user2262111 about 6 years

ideone.com/AV35CR This link contains the same function using stdint.h, however, ideone which uses gcc 6.3.0 and had troubles using a char* for unsigned char* input. Use the latest compiler :)
CopperCash about 6 years

@user2262111 thx for the explanation. So, long or unsigned long should work as well, right? Because it is "wider" than a WORD.
user2262111 over 5 years

@CopperCash a WORD is 16bits, a DWORD is your typical 32bits. longs are usually DWORD size. long is also analogous to long int. long long however have more bits to them. You probably want the standard wide variants of these. Like wchar_t for example.
AnotherDeveloper about 5 years

Hi, There is a slight improvement required in your code, after h=((h<<5)+h)+c; you should also be doing , h=h%MAX, where MAX is the max range for the HASH table, else there are chances of overflow, which may not be detected.
Aviv Cohn about 5 years

Any particular reason why this takes unsigned char* instead of char*? I mean, textual strings are usually represented simply as char*. Am I missing something? Does the algorithm assume unsigned char*?
thunderbird almost 5 years

the formula at the link shared says, hash(i) = hash(i - 1) * 33 ^ str[i]; but it is actually implemented as hash(i) = (hash(i - 1) * 33) + str[i];...is the former just a typo?
Johann Oskarsson over 4 years

If I'm not mistaken this suffers from the same problem as K&R 1st in Gabriel's answer; i.e. "ab" and "ba" will hash to the same value.
Wolfgang Brehm over 4 years

The expected number of collisions should be 9.112499989700318E+7 or 0.103 * 960³ if the hashes were truly random, so I would not have been surprised if they were all around that value, but 0.0616 * 960³ seems a bit off, almost as if the hashes are more evenly distributed than what would be expected by chance, and at 64 bytes length this limit should definitely be approached. Can you share the set of strings that you hashed so I can try to reproduce it?
Wolfgang Brehm over 4 years

The expected number of hash collisions is n - m * (1 - ((m-1)/m)^n) = 57.075.... 40 collisions is better than what could be expected by chance (46 to 70 at a p-score of 0.999). The hash function in question is more uniform than if it were random or we are witnessing a very rare event.
Emile Cormier over 4 years

@AvivCohn, if it took char*, then c would be a negative number whenever *str++ < 0 on systems where char is signed.
Emile Cormier over 4 years

@thunderbird It says "another version of this algorithm (now favored by bernstein) uses xor" [emphasis mine]
thanos.a about 4 years

@Jerry Coffin which library do I need for the rol() function?
Jerry Coffin about 4 years

@thanos.a: I'm not aware of it being in a library, but rolling your own only takes a line or two of code. Shift one chunk left, the other chunk right, and or them together.
Shipof123 over 3 years

@user2262111 For me (gcc on Linux x64), sizeof(long) is 8 bytes, also using word to measure size isn't very portable, since different processors have different word size. All of this confusion can be avoided by using stdint.h
codemonkey over 3 years

why this uses an unsigned char and how can i use that if use char
user2262111 over 3 years

@codemonkey I prefer people to use stdint.h which I touched on at the end. I require in all my programs the size I require. However, when you malloc(1 * sizeof(char)) regardless of your system size of 1 or 2 bytes per char, you get a system WORD instead. My description of WORD sizes seemed to leave out the fact that DWORD and WORD are used interchangeably because people. The true canonical term for WORD is 4 bytes on a 32bit processor and 8 bytes on a 64bit processor. Therefore your char whether perceived as 1 or 2 bytes is still 8. This is due to register girth.
user2262111 over 3 years

And don't get me started on what malloc does.
doug65536 almost 3 years

@HumbleDeveloper It doesn't matter if it overflows.
doug65536 almost 3 years

Why 33 instead of 31? 33 isn't prime, 31 is. Java uses this algorithm with 31 instead of 33 for String.hashCode.
vonbrand almost 3 years

@thanos.a, you can hand-roll it like static inline unsigned rol(unsigned r, int k) {return (r << k) | (r >> (32 - k));} (assuming 32-bit integers). At least GCC on x86-64 does compile this down to one instruction.
Gabriel Staples about 2 years

Anyone know what 14s only finish 1/20 means? I don't understand that line.
bryc almost 2 years

Some critique: result need only be non-0; 1 is fine. Good, XORing input beats ADD. Using += for ROL is much better than =. ROL can be SHL, if needed. Changing 5 to 7 reduces collisions. Two line finalizer h += h<<17; h ^= h>>12; will avalanche a bit, improving uniformity. With that, potentially the best minimal 32-bit hash/checksum in 8 instructions, beating comparable functions like ELFHash/BKDRHash/DJB2. Though ultimately sub-optimal on 64-bit machines.
Jerry Coffin almost 2 years

@bryc: Using a left shift instead of a rotate means that if you hash a long enough string, the characters you hash first will have no effect on the final result.
bryc almost 2 years

With your original code, or mine (code here)? Can't seem to reproduce that issue on mine; changing the first char of a long string changes hash as expected. I'm also noticing if I switch to ROL 7 instead of SHL 7 in my function, it actually appears to fail more collision tests, specifically a single rolling byte test (starting from index 0, a byte of 1 moves right across 16384 null bytes, individually hashing the full array each step).