hash function for string
Solution 1
I've had nice results with djb2
by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
Solution 2
First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).
Solution 3
Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
Solution 4
There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.
Solution 5
djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.
Related videos on Youtube
lilawood
Updated on July 08, 2022Comments
-
lilawood almost 2 years
I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (
% 100
) but i've got poor results with the first test of data: 40 collisions for 130 words.The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as
int table[10000]
and contains the position of the word in a .txt file.- Which is the best algorithm for hashing string?
- And how to determinate the size of hash table?
-
Carey Gregory over 12 yearsIf your hash table has 10K entries, why would you use modulo 100? Getting 40 collisions out of 130 words isn't surprising with such a small modulus.
-
Admin over 12 yearsThere are many string hash implementations available on both google and SO (read: more searching is in order). Many approaches use a "barrel shift" or "rolling" hash (possibly with "mixing" phases) -- but please pay heed to Gregory!
-
Admin over 12 yearsSee burtleburtle.net/bob/hash/evahash.html and partow.net/programming/hashfunctions for which are resources about various hashing (from general to string to crypto).
-
derobert over 12 yearsTo clarify @CareyGregory: You do realize that, as a basic mathematical truth, 130 items in 100 buckets (i.e., mod 100) must produce 30 collisions (where collision is counted as each time a second, third, etc. item is put in a bucket), correct? So you're only a little above that.
-
Carey Gregory over 12 years@derobert: Yes, a minimum of 30 collisions. But that's not my question. The OP states the hash table has 10K entries. So why a modulus of 100? Unless I misunderstand what lilawood meant, that would leave 9900 entries in the hash table unusable.
-
derobert over 12 years@CareyGregory: I think its the 8,000 word version that is using the 10,000-entry hash—hopefully with a higher modulus. I'm guessing the 130-word version (with mod 100) is a reduced-size problem for easier debugging, etc. That clarification is aimed at lilawood, to explain why (as you commented) 40 collisions isn't surprising.
-
lilawood over 12 yearsI've tried only a small part of input data to test the algorithm as @derobert said.
-
Carey Gregory over 12 years@lilawood: OK, that's what I figured, but to be a better test you should use 80 words with a hash table of 100 entries. That would give you the same proportions as your live data and wouldn't force collisions.
-
M.J. Rayburn over 6 yearsPossible duplicate of Good Hash Function for Strings
-
Yann Droneaud almost 6 yearsthe best answers for this question can be found on another stackexchange site: softwareengineering.stackexchange.com/questions/49550/…
-
Gabriel Staples about 2 years
-
Adrien Plisson over 12 yearsthe page linked in the answer is very interesting.
-
Daniel N. about 9 yearshow the program runs out of the while loop?? =S
-
rxantos almost 9 years@danfly09 When c is zero. The equivalent of while(c = *str++) would be (0 != (c = *str++))
-
Suragch almost 9 yearsThis isn't c, but I would be interested in your thoughts to this related answer: stackoverflow.com/a/31440118/3681880
-
Jerry Coffin almost 9 years@Suragch: Since I wrote this, quite a few processors have started to include either special hardware to accelerate SHA computation, which has made it much more competitive. That said, I doubt your code is quite as safe as you think--for example, IEEE floating point numbers have two different bit patterns (0 and -0) that should produce the same hashes (they'll compare as equal to each other).
-
Josepas about 8 yearsI dont understand how this works, this is gonna give me a big number right? what do I have to do if I have a table of size 13, take the mod 13 of the number this function gives to me?
-
Sandeep almost 8 yearshsearch searches by comparing the strings or string ptr address? I think it is just checking the ptr address? I tried using different pointers but same string calue. hsearch fails stating no elements found
-
WhozCraig almost 8 years@Josepas the hash function should ideally return a
size_t
or other such unsigned value (such as the unsigned long in this code). The caller is responsible for taking modulo of the result to fit it to the hash table. The caller controls the table slot being hashed to; not the function. It just returns some unsigned number. -
David Haim over 7 yearsamazing. this algorithm beat the hell out of Murmur hash, FNV variants hashes and many others! +1
-
thewaywewere almost 7 yearsWhile this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
-
Gabriel Staples almost 7 yearsI just posted the K&R 1st & 2nd edition hash algorithms here (stackoverflow.com/a/45641002/4561887) too. Although the 1st edition is purported by Dan Bernstein to be pretty terrible, I'm guessing the 2nd edition one is pretty decent. I think it's good to be aware of both of these for at least historical reasons and general awareness.
-
Gabriel Staples almost 7 yearsUpvoted for a good table, put posting the source code for each of those hashes in your answer is essential too. Otherwise, the links may break and we are out of luck.
-
CopperCash over 6 yearsWhy isn't
c
ofunsigned char
type orunsigned long
type. Is there any special reason for it to be ofint
type? -
user2262111 about 6 years@CopperCash The reason so is because an int == char, but save headache of worrying about <<'s and >>'s, esp in unoptimized code - remember a computer allocates char as a WORD and not a singular byte (depending on architecture). Fortunately, signedness of these characters do not matter, because how C handles it on
+ c;
Check it out here ideone.com/4XasWc -
user2262111 about 6 yearsideone.com/AV35CR This link contains the same function using
stdint.h
, however, ideone which usesgcc 6.3.0
and had troubles using achar*
forunsigned char*
input. Use the latest compiler :) -
CopperCash about 6 years@user2262111 thx for the explanation. So,
long
orunsigned long
should work as well, right? Because it is "wider" than a WORD. -
user2262111 over 5 years@CopperCash a WORD is 16bits, a DWORD is your typical 32bits. longs are usually DWORD size. long is also analogous to long int. long long however have more bits to them. You probably want the standard wide variants of these. Like wchar_t for example.
-
AnotherDeveloper about 5 yearsHi, There is a slight improvement required in your code, after h=((h<<5)+h)+c; you should also be doing , h=h%MAX, where MAX is the max range for the HASH table, else there are chances of overflow, which may not be detected.
-
Aviv Cohn about 5 yearsAny particular reason why this takes
unsigned char*
instead ofchar*
? I mean, textual strings are usually represented simply aschar*
. Am I missing something? Does the algorithm assumeunsigned char*
? -
thunderbird almost 5 yearsthe formula at the link shared says,
hash(i) = hash(i - 1) * 33 ^ str[i];
but it is actually implemented ashash(i) = (hash(i - 1) * 33) + str[i];
...is the former just a typo? -
Johann Oskarsson over 4 yearsIf I'm not mistaken this suffers from the same problem as K&R 1st in Gabriel's answer; i.e. "ab" and "ba" will hash to the same value.
-
Wolfgang Brehm over 4 yearsThe expected number of collisions should be 9.112499989700318E+7 or 0.103 * 960³ if the hashes were truly random, so I would not have been surprised if they were all around that value, but 0.0616 * 960³ seems a bit off, almost as if the hashes are more evenly distributed than what would be expected by chance, and at 64 bytes length this limit should definitely be approached. Can you share the set of strings that you hashed so I can try to reproduce it?
-
Wolfgang Brehm over 4 yearsThe expected number of hash collisions is
n - m * (1 - ((m-1)/m)^n) = 57.075...
. 40 collisions is better than what could be expected by chance (46 to 70 at a p-score of 0.999). The hash function in question is more uniform than if it were random or we are witnessing a very rare event. -
Emile Cormier over 4 years@AvivCohn, if it took
char*
, thenc
would be a negative number whenever*str++ < 0
on systems wherechar
is signed. -
Emile Cormier over 4 years@thunderbird It says "another version of this algorithm (now favored by bernstein) uses xor" [emphasis mine]
-
thanos.a about 4 years@Jerry Coffin which library do I need for the rol() function?
-
Jerry Coffin about 4 years@thanos.a: I'm not aware of it being in a library, but rolling your own only takes a line or two of code. Shift one chunk left, the other chunk right, and or them together.
-
Shipof123 over 3 years@user2262111 For me (gcc on Linux x64),
sizeof(long)
is 8 bytes, also using word to measure size isn't very portable, since different processors have different word size. All of this confusion can be avoided by using stdint.h -
codemonkey over 3 yearswhy this uses an unsigned char and how can i use that if use char
-
user2262111 over 3 years@codemonkey I prefer people to use stdint.h which I touched on at the end. I require in all my programs the size I require. However, when you malloc(1 * sizeof(char)) regardless of your system size of 1 or 2 bytes per char, you get a system WORD instead. My description of WORD sizes seemed to leave out the fact that DWORD and WORD are used interchangeably because people. The true canonical term for WORD is 4 bytes on a 32bit processor and 8 bytes on a 64bit processor. Therefore your char whether perceived as 1 or 2 bytes is still 8. This is due to register girth.
-
user2262111 over 3 yearsAnd don't get me started on what malloc does.
-
doug65536 almost 3 years@HumbleDeveloper It doesn't matter if it overflows.
-
doug65536 almost 3 yearsWhy 33 instead of 31? 33 isn't prime, 31 is. Java uses this algorithm with 31 instead of 33 for
String.hashCode
. -
vonbrand almost 3 years@thanos.a, you can hand-roll it like
static inline unsigned rol(unsigned r, int k) {return (r << k) | (r >> (32 - k));}
(assuming 32-bit integers). At least GCC on x86-64 does compile this down to one instruction. -
Gabriel Staples about 2 yearsAnyone know what
14s only finish 1/20
means? I don't understand that line. -
bryc almost 2 yearsSome critique:
result
need only be non-0; 1 is fine. Good, XORing input beats ADD. Using+=
forROL
is much better than=
.ROL
can beSHL
, if needed. Changing 5 to 7 reduces collisions. Two line finalizerh += h<<17; h ^= h>>12;
will avalanche a bit, improving uniformity. With that, potentially the best minimal 32-bit hash/checksum in 8 instructions, beating comparable functions like ELFHash/BKDRHash/DJB2. Though ultimately sub-optimal on 64-bit machines. -
Jerry Coffin almost 2 years@bryc: Using a left shift instead of a rotate means that if you hash a long enough string, the characters you hash first will have no effect on the final result.
-
bryc almost 2 yearsWith your original code, or mine (code here)? Can't seem to reproduce that issue on mine; changing the first char of a long string changes hash as expected. I'm also noticing if I switch to
ROL 7
instead ofSHL 7
in my function, it actually appears to fail more collision tests, specifically a single rolling byte test (starting from index 0, a byte of 1 moves right across 16384 null bytes, individually hashing the full array each step).