PHP best way to MD5 multi-dimensional array?

68,047

Solution 1

(Copy-n-paste-able function at the bottom)

As mentioned prior, the following will work.

md5(serialize($array));

However, it's worth noting that (ironically) json_encode performs noticeably faster:

md5(json_encode($array));

In fact, the speed increase is two-fold here as (1) json_encode alone performs faster than serialize, and (2) json_encode produces a smaller string and therefore less for md5 to handle.

Edit: Here is evidence to support this claim:

<?php //this is the array I'm using -- it's multidimensional.
$array = unserialize('a:6:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}i:5;a:5:{i:0;a:0:{}i:1;a:4:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}i:3;a:6:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}i:5;a:5:{i:0;a:0:{}i:1;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:3:{i:0;a:0:{}i:1;a:0:{}i:2;a:0:{}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}}}}i:2;s:5:"hello";i:3;a:2:{i:0;a:0:{}i:1;a:0:{}}i:4;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:1:{i:0;a:0:{}}}}}}}}}');

//The serialize test
$b4_s = microtime(1);
for ($i=0;$i<10000;$i++) {
    $serial = md5(serialize($array));
}
echo 'serialize() w/ md5() took: '.($sTime = microtime(1)-$b4_s).' sec<br/>';

//The json test
$b4_j = microtime(1);
for ($i=0;$i<10000;$i++) {
    $serial = md5(json_encode($array));
}
echo 'json_encode() w/ md5() took: '.($jTime = microtime(1)-$b4_j).' sec<br/><br/>';
echo 'json_encode is <strong>'.( round(($sTime/$jTime)*100,1) ).'%</strong> faster with a difference of <strong>'.($sTime-$jTime).' seconds</strong>';

JSON_ENCODE is consistently over 250% (2.5x) faster (often over 300%) -- this is not a trivial difference. You may see the results of the test with this live script here:

Now, one thing to note is array(1,2,3) will produce a different MD5 as array(3,2,1). If this is NOT what you want. Try the following code:

//Optionally make a copy of the array (if you want to preserve the original order)
$original = $array;

array_multisort($array);
$hash = md5(json_encode($array));

Edit: There's been some question as to whether reversing the order would produce the same results. So, I've done that (correctly) here:

As you can see, the results are exactly the same. Here's the (corrected) test originally created by someone related to Drupal:

And for good measure, here's a function/method you can copy and paste (tested in 5.3.3-1ubuntu9.5):

function array_md5(Array $array) {
    //since we're inside a function (which uses a copied array, not 
    //a referenced array), you shouldn't need to copy the array
    array_multisort($array);
    return md5(json_encode($array));
}

Solution 2

md5(serialize($array));

Solution 3

I'm joining a very crowded party by answering, but there is an important consideration that none of the extant answers address. The value of json_encode() and serialize() both depend upon the order of elements in the array!

Here are the results of not sorting and sorting the arrays, on two arrays with identical values but added in a different order (code at bottom of post):

    serialize()
1c4f1064ab79e4722f41ab5a8141b210
1ad0f2c7e690c8e3cd5c34f7c9b8573a

    json_encode()
db7178ba34f9271bfca3a05c5dddf502
c9661c0852c2bd0e26ef7951b4ca9e6f

    Sorted serialize()
1c4f1064ab79e4722f41ab5a8141b210
1c4f1064ab79e4722f41ab5a8141b210

    Sorted json_encode()
db7178ba34f9271bfca3a05c5dddf502
db7178ba34f9271bfca3a05c5dddf502

Therefore, the two methods that I would recommend to hash an array would be:

// You will need to write your own deep_ksort(), or see
// my example below

md5(   serialize(deep_ksort($array)) );

md5( json_encode(deep_ksort($array)) );

The choice of json_encode() or serialize() should be determined by testing on the type of data that you are using. By my own testing on purely textual and numerical data, if the code is not running a tight loop thousands of times then the difference is not even worth benchmarking. I personally use json_encode() for that type of data.

Here is the code used to generate the sorting test above:

$a = array();
$a['aa'] = array( 'aaa'=>'AAA', 'bbb'=>'ooo', 'qqq'=>'fff',);
$a['bb'] = array( 'aaa'=>'BBBB', 'iii'=>'dd',);

$b = array();
$b['aa'] = array( 'aaa'=>'AAA', 'qqq'=>'fff', 'bbb'=>'ooo',);
$b['bb'] = array( 'iii'=>'dd', 'aaa'=>'BBBB',);

echo "    serialize()\n";
echo md5(serialize($a))."\n";
echo md5(serialize($b))."\n";

echo "\n    json_encode()\n";
echo md5(json_encode($a))."\n";
echo md5(json_encode($b))."\n";



$a = deep_ksort($a);
$b = deep_ksort($b);

echo "\n    Sorted serialize()\n";
echo md5(serialize($a))."\n";
echo md5(serialize($b))."\n";

echo "\n    Sorted json_encode()\n";
echo md5(json_encode($a))."\n";
echo md5(json_encode($b))."\n";

My quick deep_ksort() implementation, fits this case but check it before using on your own projects:

/*
* Sort an array by keys, and additionall sort its array values by keys
*
* Does not try to sort an object, but does iterate its properties to
* sort arrays in properties
*/
function deep_ksort($input)
{
    if ( !is_object($input) && !is_array($input) ) {
        return $input;
    }

    foreach ( $input as $k=>$v ) {
        if ( is_object($v) || is_array($v) ) {
            $input[$k] = deep_ksort($v);
        }
    }

    if ( is_array($input) ) {
        ksort($input);
    }

    // Do not sort objects

    return $input;
}

Solution 4

Answer is highly depends on data types of array values. For big strings use:

md5(serialize($array));

For short strings and integers use:

md5(json_encode($array));

4 built-in PHP functions can transform array to string: serialize(), json_encode(), var_export(), print_r().

Notice: json_encode() function slows down while processing associative arrays with strings as values. In this case consider to use serialize() function.

Test results for multi-dimensional array with md5-hashes (32 char) in keys and values:

Test name       Repeats         Result          Performance     
serialize       10000           0.761195 sec    +0.00%
print_r         10000           1.669689 sec    -119.35%
json_encode     10000           1.712214 sec    -124.94%
var_export      10000           1.735023 sec    -127.93%

Test result for numeric multi-dimensional array:

Test name       Repeats         Result          Performance     
json_encode     10000           1.040612 sec    +0.00%
var_export      10000           1.753170 sec    -68.47%
serialize       10000           1.947791 sec    -87.18%
print_r         10000           9.084989 sec    -773.04%

Associative array test source. Numeric array test source.

Solution 5

Aside from Brock's excellent answer (+1), any decent hashing library allows you to update the hash in increments, so you should be able to update with each string sequentially, instead having to build up one giant string.

See: hash_update

Share:
68,047
Peter John
Author by

Peter John

Updated on July 20, 2021

Comments

  • Peter John
    Peter John almost 3 years

    What is the best way to generate an MD5 (or any other hash) of a multi-dimensional array?

    I could easily write a loop which would traverse through each level of the array, concatenating each value into a string, and simply performing the MD5 on the string.

    However, this seems cumbersome at best and I wondered if there was a funky function which would take a multi-dimensional array, and hash it.

  • farinspace
    farinspace about 13 years
    if for some reason you want to match the hash (fingerprint) you may want to consider sorting the array "sort" or "ksort", additionally implementing some sort of scrubbing/cleaning might be needed as well
  • Nathan J.B.
    Nathan J.B. over 12 years
    LOL! Really? I got down voted for "over" optimization? In reality, PHP's serialize is significantly slower. I'll update my answer with evidence...
  • Nathan J.B.
    Nathan J.B. over 12 years
    If anyone is interested in a JSON only test (that doesn't involve MD5), take a look here (ironically, Col. Shrapnel is trolling there as well).
  • SeanDowney
    SeanDowney almost 12 years
    What Nathan has done here is valuable even if one cannot see the value of it. It may be a valuable optimization in some situations that are outside of our context. Micro optimization is a poor decision in some but not all situations
  • wrygiel
    wrygiel almost 12 years
    it's worth noting, that this method is inefficient if you're updating with tiny fragments; it's good for big chunks of huge files though.
  • C. K. Young
    C. K. Young almost 12 years
    @wrygiel That is not true. For MD5, compression is always done in 64-byte blocks (no matter what the size of your "big chunks" are), and, if you haven't yet filled up a block, no processing happens until the block is filled up. (When you finalise the hash, the last block is padded up to a full block, as part of final processing.) For more background, read Merkle-Damgard construction (which MD5, SHA-1, and SHA-2 are all based on).
  • wrygiel
    wrygiel almost 12 years
    You're right. I was totally misled by a comment on some other site.
  • C. K. Young
    C. K. Young almost 12 years
    @wrygiel That's why it pays to do your own research when following an idea "found on the Internet". ;-) In so saying, that last comment was easy for me to write, because I actually implemented MD5 from scratch a few years ago (to practise my Scheme programming skills), so I know its workings very well.
  • bumperbox
    bumperbox over 11 years
    I am not one for micro-optimization for the sake of it, but where there is a documented performance increase for no extra work, then why not use it.
  • samitny
    samitny over 11 years
    Actually, it looks like it depends on how deep the array is. I happen to need something that needs to run as fast as possible and while your POC shows that json_encode() is ~300% faster, when I changed the $array variable in your code to my use-case, it returned serialize() w/ md5() took: 0.27773594856262 sec json_encode() w/ md5() took: 0.34809803962708 sec json_encode is (79.8%) faster with a difference of (-0.070362091064453 seconds) (the precent calculation is obviously incorrect). My array is up to 2 levels deep, so just keep in mind that (as usual) your milage may vary.
  • s3m3n
    s3m3n about 11 years
    Serialize is soooooooo much slower than json_encode from second answer. Do your server a pleasure and use json_encode! :)
  • Ligemer
    Ligemer over 10 years
    It seems like you need to benchmark your own array in order to figure out if you should use json_encode or serialize. Depending on the array it differs.
  • ReSpawN
    ReSpawN over 10 years
    Okay, I don't see why Nathan's answer is not the top answer. Seriously, use serialize and annoy your users with an immense slow site. Epic +1 @NathanJ.Brauer!
  • TermiT
    TermiT over 9 years
    i believe it's a wrong way, please check my explanation below.
  • joelpittet
    joelpittet about 9 years
    mikeytown2 showed me this test is flawed, if you reverse the order so the serialize is run second it will be faster. @see drupal.org/node/2503261#comment-10007641
  • Nathan J.B.
    Nathan J.B. about 9 years
    @joelpittet - I noticed you suggested a change yesterday. Take another look at your code from drupal.org/node/2503261#comment-10007269. You have a bug in your code -- you're calculating the completed time for JSON after you've run serialize (in addition to JSON), so you're actually getting the SUM of the two. I'm updating my answer to include reversing JSON and Serialize where you can see JSON is still winning out. :)
  • Nathan J.B.
    Nathan J.B. about 9 years
    @joelpittet - Also, your alternate code is also incorrect. It sets JSON results to the serialize variable and the serialize results to the JSON variable: dl.dropboxusercontent.com/u/4115701/Screenshots/… -- I fixed this bug and edited my answer to include that test which shows JSON winning by ~200ms.
  • Nathan J.B.
    Nathan J.B. about 9 years
    @joelpittet - Nope. Both examples in that drupal link have bugs. See the comments in my answer below. ;) E.g. dl.dropboxusercontent.com/u/4115701/Screenshots/…
  • user956584
    user956584 almost 9 years
    Not so fast, best options is to use md4, var_export is also slow
  • Jianwu Chen
    Jianwu Chen almost 9 years
    This is exactly what I want. Move and copy big truck of data in memory is not acceptable sometimes. So like other answers using serialize() is a very bad idea in terms of performance. But this API still missing if I only want to hash part of the String from a certain offset.
  • A.L
    A.L over 8 years
    Can you please explain what are big and short strings?
  • Alexander Yancharuk
    Alexander Yancharuk over 8 years
    @A.L short strings - strings that contains less than 25-30 chars. big strings - all containing more than 25-30 chars.
  • Tarsis
    Tarsis over 8 years
    Very informative Answer. I even get a performance increase of 300-700% in my use case. +1
  • Adam Pietrasiak
    Adam Pietrasiak about 8 years
    This solution has wasted few hours of my day. json_encode IGNORES objects in array, while serialize does include them. I have arrays with some different objects inside, but as json_encode was ignoring them, hashes were the same. Becouse of that, I had a lot of fun with my cache functions today.
  • Nathan J.B.
    Nathan J.B. about 8 years
    It does help to understand how json_encode and json_decode work: php.net/json_encode php.net/json_decode ;)
  • geilt
    geilt over 5 years
    It is worth to note, that json_encode is also faster than http_build_query with MD5. Serialize itself is faster than http_build_query. Thanks for the tip! Time to Update my codebase caching functions...Man I love json_encode...Here are my results on PHP 7.2 serialize() w/ md5() took: 0.043596029281616 sec json_encode() w/ md5() took: 0.025765180587769 sec http_build_query() w/ md5() took: 0.084064960479736 sec
  • Hebe
    Hebe over 3 years
    $sign = sha1(json_encode($data)); I ended up using sha1 as well, as it seems more unique.
  • AlexeyP0708
    AlexeyP0708 over 2 years
    I totally agree. ` serialize(\stdClass())=== serialize(\stdClass());`