Generate integer based on any given string (without GetHashCode)

21,867

Solution 1

MD5 hashing returns a byte array which could be converted to an integer:

var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);

Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.

Solution 2

If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.

If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.

As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.

Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.

That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.

Solution 3

This code map any string to int between 0-100

int x= "ali".ToCharArray().Sum(x => x)%100;
Share:
21,867
mrb398
Author by

mrb398

Updated on May 23, 2020

Comments

  • mrb398
    mrb398 almost 4 years

    I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.

    I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string

    It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.

    The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.

    The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.

    Any insight at all would be greatly appreciated.

  • mrb398
    mrb398 over 9 years
    Thanks for the response. This is more efficient version of the same code that I figured out a few minutes ago. My method now produces the same result on my machine and the server. Even though MD5 is working in my case, would MD5 ever produce different results on another machine?
  • Dave Zych
    Dave Zych over 9 years
    @user1691808 md5 is platform agnostic. For the same input, it will produce the same value regardless of machine/language.
  • mrb398
    mrb398 over 9 years
    Another note of something I did learn, a mysql BIGINT is always 8 bytes and can store -9223372036854775808 to 9223372036854775807, which appears to be the same range of integers the Int64 version of your code would produce. Just a tidbit of knowledge I wasn't aware of before.
  • user1582024
    user1582024 over 6 years
    Taking the sum will give a normal distribution instead of something uniform, this will increase the likelyhood of collisions.