How do I generate a hashcode from a byte array in C#?

46,771

Solution 1

The hash code of an object does not need to be unique.

The checking rule is:

  • Are the hash codes equal? Then call the full (slow) Equals method.
  • Are the hash codes not equal? Then the two items are definitely not equal.

All you want is a GetHashCode algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable or Dictionary<> will need to use the hash to optimise retrieval.

How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.

GetHashCode should be a lot quicker than Equals, but doesn't need to be unique.

Two identical things must never have different hash codes. Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).

Solution 2

I found interesting results:

I have the class:

public class MyHash : IEquatable<MyHash>
{        
    public byte[] Val { get; private set; }

    public MyHash(byte[] val)
    {
        Val = val;
    }

    /// <summary>
    /// Test if this Class is equal to another class
    /// </summary>
    /// <param name="other"></param>
    /// <returns></returns>
    public bool Equals(MyHash other)
    {
        if (other.Val.Length == this.Val.Length)
        {
            for (var i = 0; i < this.Val.Length; i++)
            {
                if (other.Val[i] != this.Val[i])
                {
                    return false;
                }
            }

            return true;
        }
        else
        {
            return false;
        }            
    }

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }
}

Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. I did the following

        // dictionary we use to check for collisions
        Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();

        // used to generate random arrays
        Random rand = new Random();



        var now = DateTime.Now;

        for (var j = 0; j < 100; j++)
        {
            for (var i = 0; i < 5000; i++)
            {
                // create new array and populate it with random bytes
                byte[] randBytes = new byte[byte.MaxValue];
                rand.NextBytes(randBytes);

                MyHash h = new MyHash(randBytes);

                if (checkForDuplicatesDic.ContainsKey(h))
                {
                    Console.WriteLine("Duplicate");
                }
                else
                {
                    checkForDuplicatesDic[h] = true;
                }
            }
            Console.WriteLine(j);
            checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
        }

        var elapsed = DateTime.Now - now;

        Console.Read();

Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode() The method that was by far the fastest and had the least number of collisions was:

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }

that took 2 seconds to execute. The method

    public override int GetHashCode()
    {
        // 7.1 seconds
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < Val.Length; i++)
                hash = (hash ^ Val[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

had no collisions also but it took 7 seconds to execute!

Solution 3

Have you compared with the SHA1CryptoServiceProvider.ComputeHash method? It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. I used it in an Identicon Handler that performed pretty well under load.

Solution 4

If you are looking for performance, I tested a few hash keys, and I recommend Bob Jenkin's hash function. It is both crazy fast to compute and will give as few collisions as the cryptographic hash you used until now.

I don't know C# at all, and I don't know if it can link with C, but here is its implementation in C.

Solution 5

Is using the existing hashcode from the byte array field not good enough? Also note that in the Equals method you should check that the arrays are the same size before doing the compare.

Share:
46,771
Andrew
Author by

Andrew

Updated on July 09, 2022

Comments

  • Andrew
    Andrew almost 2 years

    Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable).

    Here's what I have today:

    struct SomeData : IEquatable<SomeData>
    {
        private readonly byte[] data;
        public SomeData(byte[] data)
        {
            if (null == data || data.Length <= 0)
            {
                throw new ArgumentException("data");
            }
            this.data = new byte[data.Length];
            Array.Copy(data, this.data, data.Length);
        }
    
        public override bool Equals(object obj)
        {
            return obj is SomeData && Equals((SomeData)obj);
        }
    
        public bool Equals(SomeData other)
        {
            if (other.data.Length != data.Length)
            {
                return false;
            }
            for (int i = 0; i < data.Length; ++i)
            {
                if (data[i] != other.data[i])
                {
                    return false;
                }
            }
            return true;
        }
        public override int GetHashCode()
        {
            return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
        }
    }
    

    Any thoughts?


    dp: You are right that I missed a check in Equals, I have updated it. Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes). for example:

    byte[] b1 = new byte[] { 1 };
    byte[] b2 = new byte[] { 1 };
    int h1 = b1.GetHashCode();
    int h2 = b2.GetHashCode();
    

    With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. I need the hash codes for two byte arrays with the same contents to be equal.

  • Jonathan C Dickinson
    Jonathan C Dickinson over 15 years
    SHA1 is slower than MD5. If you are not worried about security then use MD5.
  • Andrew Hare
    Andrew Hare about 15 years
    +1 That was one of the clearest explanations I have ever heard for why it is beneficial to override Equals and GetHashcode.
  • Deepak
    Deepak over 11 years
    Thanks Jon .. SHA1CryptoServiceProvider.ComputeHash method worked for me..!!
  • nicolas2008
    nicolas2008 over 6 years
    Could you explain your hash algorithm
  • Daniel Bişar
    Daniel Bişar about 4 years
    You can call c functions from c# with via pinvoke. It has some performance impact (like pinning and marshalling of the passed parameters - how depends on the actual used type) but is neglectable when not calling them too frequently (which means like > thousands of times in a loop). Even some frameworks for graphic rendern (namely OpenTK, SkiaSharp) use a lot of pinvoke calls and performance is still decent.