How to generate a unique hash code for an object, based on its contents?

50,908

Solution 1

If you need to create a unique hash code, then you're basically talking about a number which can represent as many states as your type can have. For DateTime than means taking the Ticks value and the DateTimeKind, I believe.

You may be able to get away with assuming that the top two bits of the Ticks property are going to be zero, and using those to store the kind. That means you're okay up until the year 7307 as far as I can tell:

private static ulong Hash(DateTime when)
{
    ulong kind = (ulong) (int) when.Kind;
    return (kind << 62) | (ulong) when.Ticks;
}

Solution 2

From a comment:

I'd like something like a GUID based on the objects contents. I don't mind if there's the occasional duplicate every 10 trillion trillion trillion years or so

That seems like an unusual requirement but since that's your requirement, let's do the math.

Let's suppose you make a billion unique objects a year -- thirty per second -- for 10 trillion trillion trillion years. That's 1049 unique objects you're creating. Working out the math is quite easy; the probability of at least one hash collision in that time is above one in 1018 when the bit size of the hash is less than 384.

Therefore you'll need at least a 384 bit hash code to have the level of uniqueness that you require. That's a convenient size, being 12 int32s. If you're going to be making more than 30 objects a second or want the probability to be less than one in 1018 then more bits will be necessary.

Why do you have such stringent requirements?

Here's what I would do if I had your stated requirements. The first problem is to convert every possible datum into a self-describing sequence of bits. If you have a serialization format already, use that. If not, invent one that can serialize all possible objects that you are interested in hashing.

Then, to hash the object, serialize it into a byte array and then run the byte array through the SHA-384 or SHA-512 hashing algorithm. That will produce a professional-crypto-grade 384 or 512 bit hash that is believed to be unique even in the face of attackers trying to force collisions. That many bits should be more than enough to ensure low probability of collision in your ten trillion trillion trillion year timeframe.

Solution 3

You are not talking about a hash code here, you need a number representation of your state - for that to be unique it might have to be incredibly large depending on your object structure.

The reason I need to write this? I'm writing a caching layer using PostSharp.

Why don't you use a regular hashcode instead, and handle collisions by actually comparing the objects? That seems to be the most reasonable approach.

Solution 4

You can calculate ex md5 sum (or something like that) from object serialized to json. If you want only some properties to matter, you can create anonymous object on the way:

 public static string GetChecksum(this YourClass obj)
    {
        var copy = new
        {
           obj.Prop1,
           obj.Prop2
        };
        var json = JsonConvert.SerializeObject(ob);

        return json.CalculateMD5Hash();
    }

I use that for checking if someone messed with my database storing license based data. You can also append json variable with some seed to complicate stuff

Solution 5

An addition to BrokenGlass' answer, which I have voted up and consider to be correct:

Using the GetHashCode/Equals method means that if two objects hash to the same value you 'll be relying in their Equals implementation to tell you if they are equivalent.

Unless these objects override Equals (which would practically mean that they implement IEquatable<T> where T is their type), the default implementation of Equals is going to do a reference comparison. This in turn means that your cache would mistakenly yield a miss for objects which are "equal" in the business sense but have been constructed independently.

Consider the usage model for your cache carefully, because if you end up using it for classes that are not IEquatable and in a manner where you expect to be checking non-reference-equal objects for equality, the cache will turn out to be completely useless.

Share:
50,908
Contango
Author by

Contango

Have been programming for over 22 years (since I was at elementary school). Experienced in C/C++, C#, Python, SQL, etc on both Linux and Windows. Currently working in finance / financial services in the Front Office of one of the larger market makers in Europe. Experienced in full stack development involving WPF, MVVM, DI, OMS, EMS, FIX, FAST, WCF, Tibco, Bloomberg API, etc. Have built systems running at discretionary trading timescales right down to high frequency trading (HFT) timescales, working on everything from the DAL to the user interface. Passionate about great software architecture, writing bug free, maintainable and performant code, and designing great user interfaces.

Updated on January 13, 2022

Comments

  • Contango
    Contango over 2 years

    I need to generate a unique hash code for an object, based on its contents, e.g. DateTime(2011,06,04) should equal DateTime(2011,06,04).

    • I cannot use .GetHashCode() because it might generate the same hash code for objects with different contents.
    • I cannot use .GetID from ObjectIDGenerator as it generates a different hash code for objects with the same contents.
    • If the object contains other sub-objects, it needs to recursively check these.
    • It needs to work on collections.

    The reason I need to write this? I'm writing a caching layer using PostSharp.

    Update

    I think I may have been asking the wrong question. As Jon Skeet pointed out, to be on the safe side, I need as many unique combinations in the cache key as there are combinations of potential data in the object. Therefore, the best solution might be to build up a long string that encodes the public properties for the object, using reflection. The objects are not too large so this is very quick and efficient:

    • Its efficient to construct the cache key (just convert the public properties of the object into a big string).
    • Its efficient to check for a cache hit (compare two strings).
  • Contango
    Contango about 13 years
    Wow, incredibly fast answer. However, I need a unique value for any object, not just DateTime. And, on further thought, I don't mind if the odd object returns the same hash code, as I don't mind the occasional cache miss.
  • Jon Skeet
    Jon Skeet about 13 years
    @Gravitas: You haven't specified how big your hash code can be. I've given a sample for DateTime which will be unique and will work for a pretty wide range of values... but it gives a 64 bit value. If you need a 32 bit value, we'll need to take a different approach.
  • Tridus
    Tridus about 13 years
    Beat me to it. :) The problem isn't solvable with some super hash code. Use a more normal hash code and when it finds a match, use something Equals() to check if they're really the same thing.
  • Contango
    Contango about 13 years
    You're actually right: I could probably use .GetHashCode, as I don't really mind if there's the occasional cache miss.
  • Waihon Yew
    Waihon Yew about 13 years
    +1: this is of course the solution, and what the .NET hash-based containers do.
  • Waihon Yew
    Waihon Yew about 13 years
    @Gravitas: The definition of GetHashCode ensures that you will never have a problem with false negatives. It's false positives that you have to watch out for.
  • Contango
    Contango about 13 years
    @Jon - yes, you're right: I really don't want an incorrect answer to be returned for a different set of parameters. Back to the drawing board - I really need something like a GUID based on the objects contents.
  • Contango
    Contango about 13 years
    I'm not sure how I could handle collisions by comparing the objects - in PostSharp, you have to compress everything down to a string as the cache key. The objects themselves are not saved.
  • BrokenGlass
    BrokenGlass about 13 years
    @Gravitas - based on the cache key how do you retrieve the actual object then? Or would you retrieve the object if you didn't have the caching layer?
  • Waihon Yew
    Waihon Yew about 13 years
    This does not satisfy the OP's very first requirement, obviously.
  • LukeH
    LukeH about 13 years
    Off-topic, but for what it's worth you can use the built-in ToBinary and FromBinary methods to serialise/deserialise between DateTime and long (encapsulating both Kind and Ticks).
  • Eric Lippert
    Eric Lippert about 13 years
    Can you explain why this is unfeasible? The entire public key infrastructure that underlies the web security model depends on this problem having a feasible solution; since there is such an infrastructure, that's evidence that a feasible, low-cost solution exists. It is quite normal (as you correctly note) for a 32 bit hash code to have collisions, but extremely rare to the point of impossibility for a 128 bit hash code to have accidental collisions. The fact that obviously there are more than 2^128 possible strings is irrelevant; it's a big space compared to the number of documents hashed.
  • Eric Lippert
    Eric Lippert about 13 years
    Also, I think your math is a bit off there. I think you meant to say "in fact, for any hash code with finite length of n bits, there will be collisions for collections of more than 2^n elements". Right? (And more to the point, the probability of a collision happening becomes very large when there are more than 2^(n/2) elements)
  • smartcaveman
    smartcaveman almost 13 years
    how would you get a 32 bit value?
  • smartcaveman
    smartcaveman almost 13 years
    public class HaHa { public HaHa Recursive { get{ return this;} } }
  • fre0n
    fre0n almost 13 years
    @smartcaveman: good point. It's easy to fix the case where a class has a property that returns itself. But not so easy is where a pair of classes have properties that reference each other.
  • aliceraunsbaek
    aliceraunsbaek almost 10 years
    I get a TargetParameterCountException for Item in List (thrown from line with AppensFormat"{0}{1}_{2}|)
  • Lasse V. Karlsen
    Lasse V. Karlsen over 7 years
    Considering this is both an old question and a question that has multiple good answers as well as clarification from the OP about what the requirements was then your answers falls prey to the test of time. Your answer is only going to be good up to objects of a maximum of 128 bits, or 16 bytes, which is hardly a general solution. In lieu of the clarification in the bottom of the question, the json string would be more appropriate than its MD5 checksum.
  • Krzysztof Skowronek
    Krzysztof Skowronek over 7 years
    I mean, it does EXACTLY what OP meant: gets unique hash code of object, that changes when properties change, and is same for two different instances with same properties values. MD5 sum is only for constant length, even for collections. Could you elaborate thy json itself would be better? EDIT: ok, i get it :) but still, longer hash from json can be a valid way to go