How is GetHashCode() of C# string implemented?

36,351

Solution 1

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

This is possibly more than you bargained for, I'll annotate the code a bit:

  • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
  • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
  • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
  • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
  • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
  • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
  • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
  • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
  • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

Solution 2

Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.

// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
    IntPtr arg_0F_0;
    IntPtr expr_06 = arg_0F_0 = this;
    if (expr_06 != 0)
    {
        arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
    }
    char* ptr = arg_0F_0;
    int num = 352654597;
    int num2 = num;
    int* ptr2 = (int*)ptr;
    for (int i = this.Length; i > 0; i -= 4)
    {
        num = ((num << 5) + num + (num >> 27) ^ *ptr2);
        if (i <= 2)
        {
            break;
        }
        num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
        ptr2 += (IntPtr)8 / 4;
    }
    return num + num2 * 1566083941;
}
Share:
36,351
Louis Rhys
Author by

Louis Rhys

trying to learn and help others learn :)

Updated on July 05, 2022

Comments

  • Louis Rhys
    Louis Rhys almost 2 years

    I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

  • Louis Rhys
    Louis Rhys about 11 years
    yeah, I saw that, but I have no clue what it does :o
  • Louis Rhys
    Louis Rhys about 11 years
    wait. On second reading, it seems that it is different from the code in my ILSpy. Mine doesn't have the for loop over Length. Maybe it is implemented differently in different platform
  • Raymond Chen
    Raymond Chen about 11 years
    Um, it hashes the string. You did say you wanted to know what it does, so there it is. The string hash algorithm is different for different versions of the CLR.
  • Ergwun
    Ergwun about 11 years
    @LouisRhys - that was the one from .NET 2.0 (since I already had that loaded in ILSpy). I've replaced it with the one from .NET 4.0 - looks very similar.
  • user1703401
    user1703401 over 10 years
    The link is in the first sentence of the post.
  • redcalx
    redcalx about 10 years
    "it terminates early on an embedded zero" - That is bizarre. I tested it and sure enough the 64bit version ignores chars after a \0 (32bit doesn't). And since NULL is a valid unicode character, this is technically a bug IMO.
  • redcalx
    redcalx about 10 years
  • data
    data almost 10 years
    @locster the bug was closed with "as By Design". Has anyone found an explanation why the 64-bit version doesn't hash characters after \0?
  • redcalx
    redcalx almost 10 years
    If MS claim it was by design then they need to give an explanation, otherwise the reasonable assumption is that it is an error but perhaps one they can no longer fix without risks. In live systems there is a .Net config setting to specify use of an alternative hashing scheme, hence there is a simple workaround. Still seems dumb to not address this given that's it's such a key class in the system, and 64 bit is becoming the norm.
  • Jon Hanna
    Jon Hanna almost 9 years
    "designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption" It's not going to help you a lot there, a bit but not a lot. It would help if you were hashing strings that came direct from user-input leaving you open to hash-dos attacks.
  • Bahadir Acar
    Bahadir Acar over 8 years
    It seems to me that they used \0 to terminate because they don't have exact length stored anywhere while outside of WIN32 block. So they went for the null termination. String class' fields may differ for platforms. Just a guess.
  • Tamas Ionut
    Tamas Ionut about 8 years
    How come is not computed only once since string is immutable?
  • Drew Noakes
    Drew Noakes about 6 years
    The 64-bit version of the loop is not unrolled, in the usual sense. The two stages operate on hash1 and hash2 respectively, which are then combined when the loop exits. The zero check is to test for the end of the null-terminated string (in odd positions), as is done in the while condition (for even positions).
  • Paul Groke
    Paul Groke over 5 years
    @HansPassant You could store the hash code basically anywhere in the string object. Before the length field, between the length field and the string data or even after the string data.
  • M.kazem Akhgary
    M.kazem Akhgary over 5 years
    hash code could be stored in the string class itself? or maybe its impossible to change this implementation because of backwards compatibility?
  • ShadowRanger
    ShadowRanger over 4 years
    @JonHanna: To my knowledge, that's the actual reason it exists; passwords and encryption "benefits" are accidental, the real goal is ensuring that web-facing applications can't be attacked with inputs (that are stored in a dictionary) that will reliably collide on the hash code (changing O(1) optimal and normal case operations into O(n) worst case performance for every lookup, insertion, etc.).
  • Yawar Murtaza
    Yawar Murtaza over 3 years
    Just out of the topic....the while loop confirms that GetHashCode() is a linear operation.