A quick string checksum function in Perl generating values in the 0..2^32-1 range
Solution 1
Any hash function will be sufficient - simply truncate it to 4-bytes and convert to a number. Good hash functions have a random distribution, and this distribution will be constant no matter where you truncate the string.
I suggest Digest::MD5 because it is the fastest hash implementation that comes with Perl as standard. String::CRC, as Pim mentions, is also implemented in C and should be faster.
Here's how to calculate the hash and convert it to an integer:
use Digest::MD5 qw(md5);
my $str = substr( md5("String-to-hash"), 0, 4 );
print unpack('L', $str); # Convert to 4-byte integer (long)
Solution 2
From perldoc -f unpack
:
For example, the following computes the same number as the
System V sum program:
$checksum = do {
local $/; # slurp!
unpack("%32W*",<>) % 65535;
};
Solution 3
Don't know how quick it is, but you might try String::CRC.
knorv
Updated on June 03, 2022Comments
-
knorv almost 2 years
I'm looking for a Perl string checksum function with the following properties:
- Input: Unicode string of undefined length (
$string
) - Output: Unsigned integer (
$hash
), for which0 <= $hash <= 2^32-1
holds (0 to 4294967295, matching the size of a 4-byte MySQL unsigned int)
Pseudo-code:
sub checksum { my $string = shift; my $hash; ... checksum logic goes here ... die unless ($hash >= 0); die unless ($hash <= 4_294_967_295); return $hash; }
Ideally the checksum function should be quick to run and should generate values somewhat uniformly in the target space (
0
..2^32-1
) to avoid collisions. In this application random collisions are totally non-fatal, but obviously I want to avoid them to the extent that it is possible.Given these requirements, what is the best way to solve this?
- Input: Unicode string of undefined length (
-
rurban about 9 yearsB::hash also comes with core perl, uses the internal core hash function, is faster than MD5 and returns an hexified 32-bit integer. But not as secure as MD5.
-
rurban about 9 yearsThis 32bit sums of all bits is a very bad hash value for random distributions. Any hash function is better, even the most simple ones.
-
Randal Schwartz about 9 yearsSure, but that's the same problem that the System V
sum
program has. See the paragraph. Or are you arguing thatsum
is arguably broken? In that case, it's not about Perl. -
Adam Katz over 8 years
sum
is about as quick as you'll get, though as noted above, it isn't terribly robust. You can improve it slightly by using the size, e.g.$_ = <>; unpack("%32W*",$_)%65535 . length($_)
. Anything that needs to be more robust should useDigest::MD5
orDigest::SHA
, etc.