How to Create Deterministic Guids

46,340

Solution 1

As mentioned by @bacar, RFC 4122 §4.3 defines a way to create a name-based UUID. The advantage of doing this (over just using a MD5 hash) is that these are guaranteed not to collide with non-named-based UUIDs, and have a very (very) small possibility of collision with other name-based UUIDs.

There's no native support in the .NET Framework for creating these, but I posted code on GitHub that implements the algorithm. It can be used as follows:

Guid guid = GuidUtility.Create(GuidUtility.UrlNamespace, filePath);

To reduce the risk of collisions with other GUIDs even further, you could create a private GUID to use as the namespace ID (instead of using the URL namespace ID defined in the RFC).

Solution 2

This will convert any string into a Guid without having to import an outside assembly.

public static Guid ToGuid(string src)
{
    byte[] stringbytes = Encoding.UTF8.GetBytes(src);
    byte[] hashedBytes = new System.Security.Cryptography
        .SHA1CryptoServiceProvider()
        .ComputeHash(stringbytes);
    Array.Resize(ref hashedBytes, 16);
    return new Guid(hashedBytes);
}

There are much better ways to generate a unique Guid but this is a way to consistently upgrading a string data key to a Guid data key.

Solution 3

As Rob mentions, your method doesn't generate a UUID, it generates a hash that looks like a UUID.

The RFC 4122 on UUIDs specifically allows for deterministic (name-based) UUIDs - Versions 3 and 5 use md5 and SHA1(respectively). Most people are probably familiar with version 4, which is random. Wikipedia gives a good overview of the versions. (Note that the use of the word 'version' here seems to describe a 'type' of UUID - version 5 doesn't supercede version 4).

There seem to be a few libraries out there for generating version 3/5 UUIDs, including the python uuid module, boost.uuid (C++) and OSSP UUID. (I haven't looked for any .net ones)

Solution 4

You need to make a distinction between instances of the class Guid, and identifiers that are globally unique. A "deterministic guid" is actually a hash (as evidenced by your call to provider.ComputeHash). Hashes have a much higher chance of collisions (two different strings happening to produce the same hash) than Guid created via Guid.NewGuid.

So the problem with your approach is that you will have to be ok with the possibility that two different paths will produce the same GUID. If you need an identifier that's unique for any given path string, then the easiest thing to do is just use the string. If you need the string to be obscured from your users, encrypt it - you can use ROT13 or something more powerful...

Attempting to shoehorn something that isn't a pure GUID into the GUID datatype could lead to maintenance problems in future...

Solution 5

MD5 is weak, I believe you can do the same thing with SHA-1 and get better results.

BTW, just a personal opinion, dressing a md5 hash up as a GUID does not make it a good GUID. GUIDs by their very nature are non Deterministic. this feels like a cheat. Why not just call a spade a spade and just say its a string rendered hash of the input. you could do that by using this line, rather than the new guid line:

string stringHash = BitConverter.ToString(hashBytes)
Share:
46,340
Punit Vora
Author by

Punit Vora

Updated on July 28, 2021

Comments

  • Punit Vora
    Punit Vora almost 3 years

    In our application we are creating Xml files with an attribute that has a Guid value. This value needed to be consistent between file upgrades. So even if everything else in the file changes, the guid value for the attribute should remain the same.

    One obvious solution was to create a static dictionary with the filename and the Guids to be used for them. Then whenever we generate the file, we look up the dictionary for the filename and use the corresponding guid. But this is not feasible because we might scale to 100's of files and didnt want to maintain big list of guids.

    So another approach was to make the Guid the same based on the path of the file. Since our file paths and application directory structure are unique, the Guid should be unique for that path. So each time we run an upgrade, the file gets the same guid based on its path. I found one cool way to generate such 'Deterministic Guids' (Thanks Elton Stoneman). It basically does this:

    private Guid GetDeterministicGuid(string input) 
    
    { 
    
    //use MD5 hash to get a 16-byte hash of the string: 
    
    MD5CryptoServiceProvider provider = new MD5CryptoServiceProvider(); 
    
    byte[] inputBytes = Encoding.Default.GetBytes(input); 
    
    byte[] hashBytes = provider.ComputeHash(inputBytes); 
    
    //generate a guid from the hash: 
    
    Guid hashGuid = new Guid(hashBytes); 
    
    return hashGuid; 
    
    } 
    

    So given a string, the Guid will always be the same.

    Are there any other approaches or recommended ways to doing this? What are the pros or cons of that method?

  • Punit Vora
    Punit Vora about 14 years
    Thanks for your input, but this still gives me a string, and I am looking for a GUID...
  • Sam
    Sam about 14 years
    Ok, call your hash a "GUID", problem solved. Or is the real problem that you need a Guid object?
  • Punit Vora
    Punit Vora about 14 years
    i wish it were that simple.. :) but yes, i need a 'GUID' object
  • mistertodd
    mistertodd about 13 years
    This is exactly what the original poster is after. UUID already has an algorithm for you to start with a string and convert it into a GUID. UUID version 3 hashes the string with MD5, while version 5 hashes it with SHA1. The important point in creating a "guid" is to makes it "unique" against other GUIDs. The algorithm defines two bits that must be set, as well as a nibble is set to either 3 or 5, depending if it's version 3 or 5.
  • Bradley Grainger
    Bradley Grainger about 13 years
    Regarding the use of the word "version", RFC 4122 §4.1.3 states: "The version is more accurately a sub-type; again, we retain the term for compatibility."
  • Bradley Grainger
    Bradley Grainger about 13 years
    I posted some C# code to create v3 and v5 GUIDs on GitHub: github.com/LogosBible/Logos.Utility/blob/master/src/…
  • bacar
    bacar about 13 years
    "GUIDs by their very nature are non Deterministic" - this is only true of certain types ('versions') of GUIDs. However I agree that "dressing an md5 hash up as a GUID does not make a good GUID" for other reasons as spelt out by @Bradley Grainger and @Rob Fonseca-Ensor, and my answer to this question.
  • Sebastian
    Sebastian over 11 years
    @BradleyGrainger, I get Warning Bitwise-or operator used on a sign-extended operand; consider casting to a smaller unsigned type first
  • Bradley Grainger
    Bradley Grainger over 11 years
    @SebastianGodelet: Can you be a bit more specific? Which file, which line number, which version of the C# compiler are you using, etc.? When I build the Logos.Utility project (which is at warning level 4) in VS2012 Express, I get 0 warnings, 0 errors.
  • Sebastian
    Sebastian over 11 years
    @BradleyGrainger, Logos.Utility / src / Logos.Utility / GuidUtility.cs Line 63 newGuid[6] = (byte) ((newGuid[6] & 0x0F) | (version << 4));, I think it's R# complaining here, I changed to: newGuid[6] = (byte) (newGuid[6] & 0x0F | (byte)version << 4); and now no warning
  • bacar
    bacar over 11 years
    This is getting off-topic! Suggest moving individual lib bug reports to GitHub.
  • Gleno
    Gleno over 11 years
    Found this snippet to be useful when using unique identifier in a database for federated distribution.
  • porges
    porges almost 11 years
    Note that while this is useful the implementation doesn't quite get RFC4122 correct, so if you're trying to be compatible with another implementation you'll have trouble (try the example in the C code in the RFC appendix).
  • Bradley Grainger
    Bradley Grainger almost 11 years
    @Porges: RFC4122 is incorrect and has errata that fixes the C code (rfc-editor.org/errata_search.php?rfc=4122&eid=1352). If this implementation is not fully compliant with RFC4122 and its errata, please provide further details; I would like to make it follow the standard.
  • porges
    porges almost 11 years
    @BradleyGrainger: I didn't notice that, thanks/sorry! I should always remember to check the errata when reading an RFC... :)
  • Bradley Grainger
    Bradley Grainger almost 11 years
    @Porges: You're welcome/no problem. It boggles the mind that they don't update the RFC in-place with the corrections from the errata. Even a link at the end of the document would be vastly more helpful than relying on the reader to remember to search for errata (hopefully before writing an implementation based on the RFC...).
  • porges
    porges almost 11 years
    @BradleyGrainger: if you use the HTML version it has a link to the errata from the header, e.g. tools.ietf.org/html/rfc4122. I wonder if there's a browser extension to always redirect to the HTML version...
  • FelyAnony
    FelyAnony about 8 years
    Warning! This code does not generate valid Guids / UUIDs (as bacar also mentioned below). Neither the version nor the type field are set correctly.
  • FelyAnony
    FelyAnony about 8 years
    You claim "Hashes have a much higher chance of collisions ... than Guid created via Guid.NewGuid.". Can you elaborate on that? From a mathematical point of View, the number of bits one can set is the same, and both MD5 and SHA1 are cryptographical hashes, specifically designed to lower the probability of (accidental and intentional) hash collisions.
  • Brain2000
    Brain2000 about 8 years
    Wouldn't it be just as effective to use the MD5CryptoServiceProvider instead of the SHA1, since MD5 is already 16 bytes in length?
  • sapphiremirage
    sapphiremirage about 7 years
    You should consider contributing this to .NET .NET repo is here:github.com/dotnet/coreclr/tree/master/src/mscorlib/src/‌​System
  • Thai Bui
    Thai Bui over 4 years
    I would say the main difference is cryptographic hashes map from one infinite space to another fixed space using a function. Imaging a hash that maps variable length strings to 128 bits whereas Guid generates pseudo-random 128 bits. Pseudo-random generation doesn't rely on an initial input but rather by generating the output uniformly in the output space using randomness seeded from the hardware or other means.
  • angularsen
    angularsen over 3 years
    The github was perfect for me, thanks. This gist is a copy of the modifications I made in order to strip out all the unnecessary parts, unrelated to namespace guids. gist.github.com/angularsen/92a3ba9d9a94d250accd257f9f5a3d54