Hashing a Tree Structure

24,286

Solution 1

If I were to do this, I'd probably do something like the following:

For each leaf node, compute the concatenation of 0 and the hash of the node data.

For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.

Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).

Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

#! /usr/bin/env  python

import Crypto.Hash.SHA

class Node:
    def __init__ (self, parent=None, contents="", children=[]):
        self.valid = False
        self.hash = False
        self.contents = contents
        self.children = children


    def append_child (self, child):
        self.children.append(child)

        self.invalidate()

    def invalidate (self):
        self.valid = False
        if self.parent:
            self.parent.invalidate()

    def gethash (self):
        if self.valid:
            return self.hash

        digester = crypto.hash.SHA.new()

        digester.update(self.contents)

        if self.children:
            for child in self.children:
                digester.update(child.gethash())
            self.hash = "1"+digester.hexdigest()
        else:
            self.hash = "0"+digester.hexdigest()

        return self.hash

    def setcontents (self):
        self.valid = False
        return self.contents

Solution 2

Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.

That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):

[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
 sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]

You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.

But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.


If you don't want to traverse the whole tree:

One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.

This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.

If you want to hash less elements but go through the whole tree:

Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.

    --- O  ------- layer 0, n=1
       / \
      /   \
 --- O --- O ----- layer 1, n=2
    /|\    |
   / | \   |
  /  |  \  |
 O - O - O O------ layer 2, n=4
          / \
         /   \
 ------ O --- O -- layer 3, n=2

A node from a layer is picked with H mod n rule.

The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.

Solution 3

The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.

For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):

static long
tuplehash(PyTupleObject *v)
{
    register long x, y;
    register Py_ssize_t len = Py_SIZE(v);
    register PyObject **p;
    long mult = 1000003L;
    x = 0x345678L;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        x = (x ^ y) * mult;
        /* the cast might truncate len; that doesn't change hash stability */
        mult += (long)(82520L + len + len);
    }
    x += 97531L;
    if (x == -1)
        x = -2;
    return x;
}

It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.

One option of taking the position into account is the node's position in an inorder walk of the tree.

Solution 4

Any time you are working with trees recursion should come to mind:

public override int GetHashCode() {
    int hash = 5381;
    foreach(var node in this.BreadthFirstTraversal()) {
        hash = 33 * hash + node.GetHashCode();
    }
}

The hash function should depend on the hash code of every node within the tree as well as its position.

Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.

Reordering the children of a node should distinctly change the resulting hash code.

Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)

Reflecting any part of the tree should distinctly change the resulting hash code

Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)

Solution 5

The collision-free property of this will depend on how collision-free the hash function used for the node data is.

It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.

If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.

Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.

To go for something similar to Java's String hashcode:

Say you have n child nodes.

hash(node) = hash(nodedata) +
             hash(childnode[0]) * 31^(n-1) +
             hash(childnode[1]) * 31^(n-2) +
             <...> +
             hash(childnode[n])

Some more detail on the scheme used above can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Share:
24,286

Related videos on Youtube

Noldorin
Author by

Noldorin

entrepreneur; graduate in mathematics, logic, theoretical computer science, theoretical physics About Me Website Blog LinkedIn

Updated on January 10, 2020

Comments

  • Noldorin
    Noldorin over 4 years

    I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.

    Take for example the following tree:

            O
           / \
          /   \
         O     O
        /|\    |
       / | \   |
      O  O  O  O
              / \
             /   \
            O     O
    

    Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?

    A few notes on the properties of the hash function:

    • The hash function should depend on the hash code of every node within the tree as well as its position.
    • Reordering the children of a node should distinctly change the resulting hash code.
    • Reflecting any part of the tree should distinctly change the resulting hash code

    If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.


    UPDATE

    Well, here's my own proposed solution. It has been helped much by several of the answers here.

    Each node (sub-tree/leaf node) has the following hash function:

    public override int GetHashCode()
    {
        int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
            this.Value.GetHashCode()));
        for (int i = 0; i < this.Children.Count; i++)
            hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
        return hashCode;
    }
    

    The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).

    Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.

    • Noldorin
      Noldorin over 14 years
      @Eli Bendersky: Indeed. I've modified the question to imply "as collision-free as possible".
    • Jason Orendorff
      Jason Orendorff over 14 years
      None of these answers do a great job of explaining it, but a tree is just a tuple (node-local data, subtree0, subtree1, ...). Tuples are hashable. Done. See vatine's and pnm's answers for more details.
    • P Shved
      P Shved over 14 years
      @Jason, knowing this, I assumed (at least, before the second edit) that trees are "too large" to be hashed as mere tuples.
    • BlueRaja - Danny Pflughoeft
      BlueRaja - Danny Pflughoeft over 14 years
      @Eli Bendersky: For all practical purposes, collision-free is pretty simple. For instance, SHA1 is 15 years old and only 160 bits, yet even with our best super-computers no one has ever found two values with the same SHA1 hash (though my guess is that will happen pretty soon).
    • San Jacinto
      San Jacinto over 14 years
      @BlueRaja yes, but try to map the output of SHA1 to an serially and linearl-increasing, addressable space that is say, 1,000 elements long. Now try to tell me THAT will be collision-free.
    • Noldorin
      Noldorin over 14 years
      To all: please see my update.
    • William
      William over 14 years
      @San Jacinto considering there has never been a collision yet, and cryptologist's are TRYING to find one, the probability is very low you'll EVER find a collision, even with that architecture.
    • San Jacinto
      San Jacinto over 14 years
      @William you're missing the point entirely. It is nigh impossible to map the entire possible output space of SHA1 to a 1000-element array, for instance. You've got to squash it somehow. I would be flabberghasted if you could hash a couple hundred elements without a collision on a 1000-element table.
    • Vladimir Panteleev
      Vladimir Panteleev about 7 years
      Would Merkle trees help?
  • Noldorin
    Noldorin over 14 years
    Interesting suggestion. Not sure how much of a problem it might turn out to be, but these trees could potentially be quite deep and don't really have a maximal number of nodes (could be 10, 100, 1000, or even a bit larger).
  • Noldorin
    Noldorin over 14 years
    You make a good point in general. However, trees are slightly different to sequences in that they contain greater structure - a tree cannot simply be represent by a sequence (unless you specifically know it is a binary/ternary/etc. tree). Might be able to adapt the algorithm easily enough however...
  • P Shved
    P Shved over 14 years
    A tree can be represented as a sequence. I showed in my answer and example of how.
  • President James K. Polk
    President James K. Polk over 14 years
    The "position" incorporates the path information. For example, for every node assign a position value of 0 for the node itself and 1..n for each of its n children from left to right. When you visit child number i in your traversal, you include i in the hash. When you visit the node itself, include 0 and the node's hash contents. The choice of the constants 0, 1, ..., n was arbitrary and should instead be chosen based on domain-specific knowledge, e.g. maybe "0-mississippi", "1-mississippi", etc. will work better.
  • Noldorin
    Noldorin over 14 years
    In my situation, I'll often be comparing various trees against the same other tree many times. In this case, surely it would be more efficient to compute the hash, since you'll only need to recurse over the common tree's nodes once?
  • Noldorin
    Noldorin over 14 years
    @Pavel Shved: It can indeed, yet a sequence is still an ambiguous representation. See this for example: pastebin.com/m44d5b6b6 (the same applies to a depth-first traversal)
  • Larry Watanabe
    Larry Watanabe over 14 years
    I got that - that's why i modified my answer to include the hash code generator. You can simply write the tree out as a string, then sample it. If you keep every character, you are guaranteed no collisions, but less efficient. every other character, more efficient, possible collisions. You choose the tradeoff based on your application, data, etc.
  • Larry Watanabe
    Larry Watanabe over 14 years
    hmm, why is everyone's answer so complex? You can generate an exact hashcode in 5 lines, and sample it to generate a shorter hashcode (see my answer below).
  • P Shved
    P Shved over 14 years
    @Noldorin, that's why it's important to add aditional symbols to sequence, so it will be of greater length than the original tree.
  • Noldorin
    Noldorin over 14 years
    @Larry: Ah, that's a novel approach. +1 It's certainly simple and effective I would think... though perhaps not the most efficient. WIll have a think about it.
  • Noldorin
    Noldorin over 14 years
    @George: Mind commenting/explaining that code a bit more please?
  • Larry Watanabe
    Larry Watanabe over 14 years
    Well, you can always optimize it easily enough by skipping characters instead of actually generating them. As long as the skip function is consistent, it would still return a valid hash code. Get it right first, optimize later :) This simple approach is easy to optmize and extend. If you look at the other approaches you will see that they are just using a lazy generation method for the hash code, but it is a little more ad hoc than this approach and much less simple. Start with simple and correct, then optimize later.
  • P Shved
    P Shved over 14 years
    @Larry, more tricky solutions are usually more complex than straightforward ones.
  • Noldorin
    Noldorin over 14 years
    @Jason: Thanks for the reply. This indeed a nice simple solution - it's what first game to my mind, yet it fails the condition I propose here: pastebin.com/m44d5b6b6 (Sorry for note stating this in my original question).
  • Jason Orendorff
    Jason Orendorff over 14 years
    -1. All hash code algorithms I know of use all the data. Hash codes are simply cached to make calculating them constant-time in practice.
  • Jason Orendorff
    Jason Orendorff over 14 years
    +1. This is the right answer. Calculating this can be O(1) amortized, because you can just cache at each node the hash of the subtree from there. (When changes are made, you can walk up the tree, just marking each cached hash code as invalid rather than recalculating them. This way when many changes are made one after another, you don't have to walk up the tree every time.)
  • Jason Orendorff
    Jason Orendorff over 14 years
    -1. The proposed approach (walking the tree to build a list) is too complicated. The problem has a simple, direct solution, given by vatine and pnm.
  • Jason Orendorff
    Jason Orendorff over 14 years
    This has a bug. Instead of this.InOrderTraversal() it should say this.ChildNodes. Otherwise each node will be visited 2^(n-1) times where n is the number of ancestors it has...
  • Jason Orendorff
    Jason Orendorff over 14 years
    I don't think this satisfies the requirement of distinguishing trees that have the same preorder traversal but different structures.
  • P Shved
    P Shved over 14 years
    @Jason Orenforff, (a) I proposed three approaches, (b) I don't think it's complicated. Wherther this is my problem, I'm not sure.
  • jason
    jason over 14 years
    @Noldorin: You're right, and a depth-first search would have a similar problem. Consequently I think you're going to need to encode the path that you take into the process.
  • jason
    jason over 14 years
    @Jason Orendorff: Huh? In a tree traversal every node is visited once.
  • Keith Randall
    Keith Randall over 14 years
    I don't think that was listed as a requirement. If you want that, you could add node depth to the hash. I suspect preorder index plus node depth would determine a unique tree.
  • Noldorin
    Noldorin over 14 years
    Good suggestion; at the least, for the proposal of cascading changes up the tree and caching hache codes.
  • CPerkins
    CPerkins over 14 years
    +1, but I have a modification to suggest, assuming that node hashcode calculation has a measurable cost: instead of recalculating the cached hashcodes on change, invalidate them. You have to walk up the tree either way, but there's no need to recalculate your hashcodes until they're called for, so if you get multiple updates between comparisons, you only pay the recalculation cost once per comparison.
  • BlueRaja - Danny Pflughoeft
    BlueRaja - Danny Pflughoeft over 14 years
    +1 for correct answer. Poster asked for the best theoretical solution, and this is exactly how they do it in cryptography papers. If security is not an issue, then you can probably just get away with concatenating all the values (and their numeric positions) and hashing that, for a very-quick, usually-collision-free-assuming-no-malicious-users hash.
  • Noldorin
    Noldorin over 14 years
    @Jason: Yeah, this is reasonably similar to my own suggested solution. I'm wondering, is there any real advantage of initialising hash to a non-zero (prime) value?
  • Noldorin
    Noldorin over 14 years
    Yeah, I'm definitely leaning towards accepting this answer. If you could make any comments/suggestions on the specific algorithm of my own proposed solution, would appreciate that and would definitely give you the answer.
  • Noldorin
    Noldorin over 14 years
    @Keith: Jason is right - order of traversal is not enough - structure needs to be accounted for too.
  • Noldorin
    Noldorin over 14 years
    @vatine: Thanks for the extra information in your edits. This seem like a pretty complete answer now, so the bounty is yours. :)