Consistent String#hash based only on the string's content

28,292

Solution 1

there are lot of such functionality in ruby's digest module: http://ruby-doc.org/stdlib/libdoc/digest/rdoc/index.html

simple example:

require 'digest/sha1'
Digest::SHA1.hexdigest("some string")

Solution 2

The easiest (and consistent) way may be this (and it's fast):

"https://www.example.com/abc/def/123?hij=345".sum % 4

That will always produce an integer 0 - 3, is quite fast, and should be fairly well distributed (though I haven't actually run tests on distribution).

Solution 3

There is tiny library xxHash:

XXhash.xxh32('qwe') #=> 2396643526
XXhash.xxh64('qwe') #=> 9343136760830690622

Maybe it will have more collisions but it is 10x faster than SHA1:

Benchmark.bm do |x|
  n = 100_000
  str = 'qweqweqwe'
  x.report('xxhash32') { n.times { XXhash.xxh32(str) } }
  x.report('xxhash64') { n.times { XXhash.xxh64(str) } }
  x.report('hexadigest') { n.times { Digest::SHA1.hexdigest(str) } }
end;1

#       user     system      total        real
# xxhash32  0.020000   0.000000   0.020000 (  0.021948)
# xxhash64  0.040000   0.000000   0.040000 (  0.036340)
# hexadigest  0.240000   0.030000   0.270000 (  0.276443)

Solution 4

You can try to_i(36).

"Hash me please :(".to_i(36)
=> 807137
Share:
28,292
Rob Davis
Author by

Rob Davis

Updated on August 30, 2020

Comments

  • Rob Davis
    Rob Davis almost 4 years

    GOAL: Map every URL handled by a server to 0, 1, 2, or 3, distributing as uniformly as possible.

    While the documentation for ruby's String#hash method says it will "return a hash based on the string‘s length and content," this clearly isn't the whole story. A given string's hash is not consistent across invocations of the interpreter:

    $ irb
    ruby-1.9.2-p180 :001 > "foo".hash
     => 360517580588231756 
    ruby-1.9.2-p180 :002 > ^D
    
    $ irb
    ruby-1.9.2-p180 :001 > "foo".hash
     => -2716152678666510148 
    

    This means a particular string's hash value may differ across, say, servers. Rails uses String#hash internally to map a URL path to one of four asset hosts (if the app's asset_host is so configured), but this feature is a lot less efficient than it could be because of the cross-machine inconsistencies; different servers may map the same URL to different asset hosts, reducing the effectiveness of caches, clouding skies, cooling cups of tea prematurely, besmirching the reputations of otherwise fine programmers.

    Can you suggest an alternate hash function that could effectively and speedily distribute hashes across a typical app's URL space, preferably one that produces a Fixnum since, in the end, I'll want to map it into one of four asset hosts?