How do I generate an (almost) unique hash ID for objects?

12,354

Solution 1

If you want to create a hash of all of your data, you'll need to make sure that you can get all the values in byte format from them.

To do this, it's best if you have control of all the classes (except the Java built-in ones, perhaps), so that you can add a method to them to do this.

Given that your object is very large, it will probably not be a good idea to just collect it into one big byte array recursively and then calculate the digest. It's probably better to create the MessageDigest object, and add a method such as:

void updateDigest( MessageDigest md );

to each of them. You can declare an interface for this if you wish. Each such method will collect the class's own data that participates in the "big calculation" and update the md object with that data. After updating all its own data, it should recursively call the updateDigest method of any classes in it that have that method defined.

For example, if you have a class with fields:

int myNumber;
String myString;
MyClass myObj;  // MyClass has the updateDigest method
Set<MyClass> otherObjects;

Then its updateDigest method should be doing something like this:

// Update the "plain" values that are in the current object
byte[] myStringBytes = myString.getBytes(StandardCharsets.UTF_8);
ByteBuffer buff = ByteBuffer.allocate(
                        Integer.SIZE / 8    // For myNumber
                        + Integer.SIZE / 8  // For myString's length
                        + myStringBytes.length
                  );
buff.putInt( myNumber );
buff.putInt( myStringBytes.length );
buff.put( myStringBytes );
buff.flip();
md.update(buff);

// Recurse
myObj.updateDigest(md);

for ( MyClass obj : otherObjects ) {
    obj.updateDigest(md);
}

The reason I added the string's length (actually, its byte representation's length) to the digest is to avoid situations where you have two String fields:

String field1 = "ABCD";
String field2 = "EF";

If you just put their bytes directly into the digest one after the other, it will have the same effect on the digest as:

String field1 = "ABC";
String field2 = "DEF";

And this may cause an identical digest to be generated for two different sets of data. So adding the length will disambiguate it.

I used a ByteBuffer because it's relatively convenient to add things to it like int and double.

If you have classes that you don't control and cannot add a method to, you'll have to be creative. After all, you do get the values from every such class for the calculation, so you may call the same methods and digest their results. Or you could digest their serialized form if they are serializable.

So in your head class you'll create the md object using MessageDigest.getInstance("SHA") or whatever digest you wish to use.

MessageDigest md = null;
try {
    md = MessageDigest.getInstance("SHA");
} catch (NoSuchAlgorithmException e) {
    // Handle properly
}

// Call md.update with class's own data and recurse using
// updateDigest methods of internal objects

// Compute the digest
byte [] result = md.digest();

// Convert to string to be able to use in a hash map
BigInteger mediator = new BigInteger(1,result);
String key = String.format("%040x", mediator);

(You could actually use the BigInteger itself as the key).

Solution 2

You are calling hash() on an object, and your objective is to remember the result because computation is expensive and the result is invariant unless some state changes?

So why not keep the result in an instance variable of the object. Have some logic like

  calculate() {
      if ( m_cachedResult == null ){
          m_cachedResult = origincalCaclulate(); // refactored original
      }
      return m_cachedResult;
  }

Then, if you can ensure that all relevant state is modified via setters on this class, clear the cache when recomputation is needed

  setThing(newValues) {
        m_cachedResult = null;
        //process new state values
  }    

Solution 3

Actually you have an object called UUID

A class that represents an immutable universally unique identifier (UUID). A UUID represents a 128-bit value.

You can find some ideas here, for example:

import java.util.UUID;

public class GenerateUUID {
   public static UUID generate() {
        UUID idOne = UUID.randomUUID();
        return idOne;
   }
}

Then just check if exists in created objects (which will be almost impossible) and call again if necessary.

Solution 4

Computing some hash-like identifier is not good way to do that in general. The chance of conflict to happen is extremelly low, but it still can happen. Keep in mind, that hash is not 100% random number, it is in most cases somehow linked with the input data, so, depending on your hash method, some hashes could be unaccessible, or - in the worse case for you - some of them can be common for quite big set of input objects. It could be computed preciselly, but that's in terms of computer science and probability theory.

Usage of some distest function (MD5, SHA, etc.) could help a lot, but still won't resolve the problem completely.

The solution I'll prefer is similar to the Jordi's. Enhance your class with some identifier. Depending on your project - I'll set up, for instance creation date and/or name of such task. The String name or description of task could make debugging easier.

If theese won't be unique enough, you can allways add unique numeric counter (or UUID instance).

Share:
12,354
phil294
Author by

phil294

Hi, I'm Philip. You can contact me at [email protected]

Updated on August 06, 2022

Comments

  • phil294
    phil294 over 1 year

    How can I get an ID for my objects that makes it easy to distinguish it from others?

    class MyClass {
        private String s;
        private MySecondClass c;
        private Collection<someInterface> coll;
        // ..many more
    
        public Result calculate() {
            /* use all field values recursively to calculate the result */
            /* takes considerable amount of time. Implemented */
            return result;
        }
    
        public String hash() {
            /* use all field values recursively to generate a unique identifier */
            // ?????
    }
    

    calculate() usually takes ~40 seconds to complete. Thus, I do not want to call it multiple times.

    MyClass objects are quite huge (~60 MB). The Result value of the calculation will only be ~100 KB.

    Whenever I am about to run the calculation on an object, my program should look up if that has been done some time earlier already, with the exact same values, recursively. If so, it will look up the result in (e.g.) a HashMap instead. Basically, MyClass objects itself could be used as keys, but the HashMap will include 30-200 elements - I obviously don't want to store all of that in full size. That's why I want to store 30-200 Hash/result values instead.

    So, I thought I'd generate a ID (hash) over all values inside my MyClass object. How do I do that? This way, I can use that very hash to look up the result. I am aware that a hash code like MD5 will not guarantee 100% uniqueness, because multiple objects might have the same hash. However, if I store (at maximum) 200 elements via MD5, the chance for a twice used hash will be neglectible, I think. There are 16^32=3.4e38 different hash codes possible. I'll be happy to hear anybodys comments about it, or see other approaches.

    Once the hash is generated, I don't need that object anymore, just its respective result value.

    Two seperate objects with the exact same values have to return the same hash code. Much like original hashCode(), just with that I'm trying to maintain uniqueness. The probability for two objects having the same hash code should be absolutely neglectible.

    I don't know how to describe the problem in other words anymore. If further clarification is needed, please ask.

    So how can I generate my MyClass.hash()?

    The problem isn't really about how or where to store the hashes, because I don't even know how I can generate an (almost) unique hash for an entire object, that will always be the same for same values.


    Clarification:

    When talking of size, I mean the serialized size on the hard drive.

    I don't think putting the objects in a HashMap would decrease their size. That's whay I want to store some hash String instead. HashMap<hashStringOfMyClassObject, resultValue>

    When you put an object in a HashMap (either as a key or as a value), you don't create a copy of it. So storing 200 large objects in a HashMap consumes little more memory than the 200 objects themselves.

    I do not store 200 large objects themselves. I only keep 200 different results (as values) which are small, and 200 respective hashCodes of MyClass objects which are also very small. The point of "hashing" the objects is to be able to work with the hash instead of with the object values themselves.